Impact of Big Data on Computational Intelligence ...

6 downloads 11062 Views 6MB Size Report
Center for Information Assurance | for the US Department of Energy ..... Calling this an epidemic, Clarke explained that he frequently receives calls from .... described by a three layer classification called SPI for SaaS, PaaS, and IaaS as follows:.
Impact of Big Data on Computational Intelligence Aspects of Cyber Security and the Computing Environment to Support Repeatable Scientific Experimentation IEEE Symposium on Computational Intelligence in Cyber Security (CICS 2015) Keynote, IEEE Symposium Series on Computational Intelligence (SSCI 2015) December 8-10, 2015 Cape Town, South Africa

Presented by: Robert K. Abercrombie University of Memphis | Center for Information Assurance | FedEx Institute of Technology |

ORNL is managed by UT-Battelle for the US Department of Energy

Agenda For Today’s Presentation • Impact of Big Data – The Changing Nature of our World and Research – Data Explosion – Business of Big Data

• Computational Intelligence Aspects of Cyber Security • A Computing Environment in Cyberspace to Support Repeatable Scientific Experimentation – Procedure for a Computing Environment to Support Repeatable Scientific Big Data Experimentation Motivation – Computing Environment and its Capabilities – Analysis of Technical Requirements and Alternatives

• Conclusions and Future Directions

2

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

DAY

NIGHT

The Changing Nature of Our World

B. Endicott-Popovsky, “A Probability of 1Journal,” Journal of Cyber Security and Information Systems Vol. III, No. 1, Issue: Applying Modeling & Simulation for Defense, March 2015, pp 18-19. 3

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

The Changing Nature Of Research 2

. c2  a  4G  a   3   a 2  

Thousand Years Ago

Last Few Hundred Years

Last Few Decades

Description of natural phenomena

Newton’s laws, Maxwell’s equations…

Simulation of complex phenomena

Today and the Future

Unify theory, experiment, and simulation with large multidisciplinary data

Using data exploration and data mining (from instruments, sensors, humans…) Distributed Communities

4

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Caution to the Fourth Paradigm “There are three kinds of lies: lies, damned lies, and statistics.” “Every generalization is false, including this one.” - Mark Twain The thrill of Human Scientific Discover must not be lost on computerized methods and data-intensive scientific research.

5

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Big Data = Volume, Variety and Velocity

6

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

The Data Explosion

Information Technology

Deliver the capability to mine, search, and analyze this data in near real time

7

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Science itself is evolving

Volumes and Rates • Published Papers/Patents at ORNL

8

7 TB

• Library of Congress Text

20 TB

• Amazon

42 TB

• ChoicePoint

250 TB

• ORNL Scientometrics Cloud

260 TB

• AT&T

323 TB

• US Government

848 TB (`09)

• US Discrete Manufacturing Companies

966 TB (`09)

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Volumes and Rates • Twitter Updates

400 M/d

• Facebook – Likes/Comments – Shared Contents

• World Emails

2.7 B/d 30 B/m

419 B/d

• YouTube – Storage – Traffic

76 PB/yr. 16.2 EB/yr.

• World Social Media

9

1.8 ZB (x2 every 2 yrs.)

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Volumes and Rates

10

• 2.5 m Telescope

200 GB/d

• Ion Mobility Spectroscopy

10 TB/d

• X-ray Photon Correlation Spectroscopy 3D X-ray Diffraction Microscopy

24 TB/d

• Boeing 737 cross-country flight

240 TB

• Personal Location Data

1 PB/yr.

• Astrophysics Data

10 PB (2014)

• Square Kilometer Array

480 PB/d

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

The Business of Big Data • $300 billion annual value of big data for the U.S. health care system, two-thirds of which would come in reduced expenditures*.

• $165 billion worth of value for big clinical data (McKinsey). • 966 petabytes data stored by discrete manufacturing companies in the U.S. during 2009; 848 petabytes of data stored by government in the same year (McKinsey). • By 2020, IT departments will have 10 times more servers and 50 times more data to look after than they do now.

11

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

The Business of Big Data

(cont.)

• The U.S. will face shortages* of: – between 140,000 and 190,000 individuals with “deep analytical skills” capable of working with very large data sets; – between 300,000 and 400,000 skilled technicians and support staff; – about 1.5 million “data-savvy“ managers and analysts.

* http://www.mckinsey.com/features/big_data 12

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

The Business of Big Data (cont.)

http://www.internetworldstats.com/stats.htm 13

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Cyberspace* and Information Security Fundamentals

Background – in the context of Computational Intelligence Aspects of Cyber Security The term cyberspace usually refers to the worldwide collection of connected ICT components, the information that is stored in and flows through those components, and the ways that information is structured and processed 14

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Concept of Cybersecurity • The act of protecting information and communications technology (ICT) systems and their contents has come to be known as cybersecurity. • A broad and arguably somewhat fuzzy concept, cybersecurity can be a useful term but tends to defy precise definition. It usually refers to one or more of three things: – A set of activities and other measures intended to protect—from attack, disruption, or other threats—computers, computer networks, related hardware and devices software, and the information they contain and communicate, including software and data, as well as other elements of cyberspace. – The state or quality of being protected from such threats. – The broad field of endeavor aimed at implementing and improving those activities and quality. 15

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Definition – Confidentiality, Integrity, & Availability • 44 U.S. Code 3502 – defines Information Security as a means of protecting information […] from unauthorized access, use, disclosure, disruption, modification, or destruction in order to provide: • confidentiality, which means preserving authorized restrictions on access and disclosure, including means for protecting personal privacy and proprietary information; • integrity, which means guarding against improper information modification or destruction, and includes ensuring information nonrepudiation and authenticity; and • availability, which means ensuring timely and reliable access to and use of information 16

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

The Threat Space in Cyber Space • National intellectual property is being stolen at alarming rates • National assets are vulnerable to attack and exploitation

• Personal Identifiable Information at risk • Competing and difficult national priorities for resources

Electric Power Water

Emergency

17

Oil & Gas

Transportation Communications

Financial

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Attack Surface Categories

Network Attack Surface

18

Software Attack Surface

Vulnerabilities over an enterprise network, widearea network, or the Internet

Vulnerabilities in application, utility, or operating system code

Network protocol vulnerabilities, e.g., used for a DOS attack, disruption of communications links, and various intrusions

Particular focus is Web server software

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Human Attack Surface Vulnerabilities created by personnel or outsiders, such as social engineering, human error, and trusted insiders

Every Computer Security Strategy Consists of:

19

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Cyber Security Aspects of Interests

20

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Who “Really” Are the Threat Actors? • Over 90% of threat actors are external to an organization

• 55% of the actors associated with organized crime • Predominantly in U.S. and Eastern Europe – ~20% of actors associated with nation-state operations – Over 90% attributable to China

• Internal actors: large percentage of events tied to unintentional misconfigurations • From 2015 Data Breach Investigations Report – – From 2010 – 2014, virtually no change in overall proportion attributed to external, internal, and partner actors. Source: http://www.verizonenterprise.com/DBIR/2013 and 2015 21

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Federal Agency Cybersecurity Roles

https://www.fas.org/sgp/crs/misc/R43831.pdf - Eric A. Fischer, “Cybersecurity Issues and Challenges: In Brief,” December 16, 2014, Congressional Research Service, Report 7-5700, R438331, www.crs.gov. 22

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Near-term Versus Long-needs in Cybersecurity • The executive-branch actions and proposed legislation are largely designed to address several well-established near-term needs in cybersecurity: – – – –

preventing cyber-based disasters and espionage, reducing impacts of successful attacks, improving inter- and intra-sector collaboration, clarifying federal agency roles and responsibilities, and fighting cybercrime.

• However, those needs exist in the context of more difficult long-term challenges relating to: – design, incentives, consensus, and environment (DICE) https://www.fas.org/sgp/crs/misc/R43831.pdf - Eric A. Fischer, “Cybersecurity Issues and Challenges: In Brief,” December 16, 2014, Congressional Research Service, Report 7-5700, R438331, www.crs.gov. 23

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Long-term Cybersecurity challenges – Design, Incentives, Consensus, Environment – “DICE” • Design: Experts often say that effective security needs to be an integral part of ICT design. – Yet, developers have traditionally focused more on features than security, for economic reasons. – Also, many future security needs cannot be predicted, posing a difficult challenge for designers.

• Incentives: The structure of economic incentives for cybersecurity has been called distorted or even perverse. – Cybercrime is regarded as cheap, profitable, and comparatively safe for the criminals. – In contrast, cybersecurity can be expensive, is by its nature imperfect, and the economic returns on investments are often unsure.

24

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Long-term Cybersecurity challenges – Design, Incentives, Consensus, Environment – “DICE” (cont.) • Consensus: Cybersecurity means different things to different stakeholders, with little common agreement on meaning, implementation, and risks. – Substantial cultural impediments to consensus also exist, not only between sectors but within sectors and even within organizations.

• Environment: Cyberspace has been called the fastest evolving technology space in human history, both in scale and properties. – New and emerging properties and applications – especially: • social media, • mobile computing, • big data, • cloud computing, and • the Internet of things

– Further complicate the evolving threat environment, but they can also pose potential opportunities for improving cybersecurity, for example through the economies of scale provided by cloud computing and big data analytics. 25

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Current Events – December 1, 2015*

• On average, companies that got breached did not know it for 270 days and some had even been breached for seven years without knowing it, according to Richard Clarke, the former White House cybersecurity czar who served three presidents. • Clarke explained that: – two-thirds of those entities did not even discover the breach internally; – it was pointed out to them, either by someone outside the organization or by the federal government.

* http://www.healthcareitnews.com/news/7-cyber-threats-other-phi-or-pii-breaches - Opening keynote “Cybersecurity in 2015: From Theft to Destruction,” Healthcare IT News Privacy and Security Forum

26

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Current Events – December 1, 2015* • As bad as breaches are, however, Clarke offered seven: – Ransomware. Calling this an epidemic, Clarke explained that he frequently receives calls from clients who have been subject to someone essentially seizing their data and demanding money to give it back. – DDoS. Distributed Denial of Services attacks, previously thought to be a minor problem, have reemerged with high profile attacks against American banks, Clarke said. "DDoS is now, again, a threat. It's something you can send down the wire to an entity and knock it offline.“ – Wiper attacks. "Think Sony or Saudi Aramco," Clarke said. Aramco had 30,000 end points, for instance, until one morning employees came in to work and found that all the software had been wiped out in a 7-minute attack. At Sony, in the days after the attack, guards couldn't look up his name to check Clarke in because all the devices were wiped blank.

* http://www.healthcareitnews.com/news/7-cyber-threats-other-phi-or-pii-breaches - Opening keynote “Cybersecurity in 2015: From Theft to Destruction,” Healthcare IT News Privacy and Security Forum

27

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Current Events – December 1, 2015* – Intellectual property theft. IP theft is "probably the most damaging thing that happens," Clarke said. "If it's IP that's worth something and is online, it will be stolen.“ – Straight theft of money. One increasingly common trick is that hackers assume the identity of someone in the comptroller's office who sends out wire transfers for accounts payable. They then wire relatively small amounts, say $100,000, to an offshore account, transfer it to another account elsewhere and it's gone. – Data manipulation. Wall Street's greatest fear is not data being stolen but the potential for someone to manipulate the data so firms don't really know who owns what anymore. An example particular to healthcare? Hackers changing data about blood transfusions could be deadly. – Data destruction. Devices can be physically destroyed by code. Clarke took part in the Aurora experiment at the Department of Energy's lab in Idaho. "We hacked into a simulated power grid, took control, gave it the wrong commands through software and destroyed a large electric power generator," • Clarke said, adding that this just one example, while many real world devices can be destroyed by software. * http://www.healthcareitnews.com/news/7-cyber-threats-other-phi-or-pii-breaches - Opening keynote “Cybersecurity in 2015: From Theft to Destruction,” Healthcare IT News Privacy and Security Forum

28

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Computational Intelligence Aspects of Cyber Security

29

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Computing Environment in Cyberspace* to Support Repeatable Scientific Experimentation

Background – in the context of Computational Intelligence Aspects of Cyber Security The term cyberspace usually refers to the worldwide collection of connected ICT components, the information that is stored in and flows through those components, and the ways that information is structured and processed 31

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Background A principal tenet of the scientific method is that experiments must be repeatable. This tenet relies on ceteris paribus (i.e., all other things being equal).

• Scientific policy and research assessment community is: – investigating methods and techniques to establish an environment where experiments can be repeated through the use of data management. – This approach attempts to ensure the integrity of scientific findings and the processes from which experimentation and analysis is conducted.

• Data Science is the study of the generalizable extraction of knowledge from data. From this definition, scientific development thus becomes the piecemeal process by which these items have been added, singly and in combination, to the ever growing stockpile that constitutes scientific technique and knowledge.

• Examples: – Scientific literature analysis, or Scientometrics, is the study of measuring and analysing science, technology and innovation. – Organizations, such as Thomson Reuters, have long used these analyses to identify the most influential papers or researchers in a field. Recently, Foresight and Understanding from Scientific Exposition (FUSE) takes this further by mining millions of papers and patents in both English and Chinese. 32

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Background (cont.) • Scientometrics and its related research activities in today’s world make extensive use of digital research data. – The data management of this digital research data is, in essence, the quintessential requirement for repeatable scientific experimentation.

• This term, digital research data, encompasses a wide variety of information stored in digital form including: – experimental, observational, and simulation data, codes, software and algorithms, text, numeric information, images, video, audio, and associated metadata. – It also encompasses information in a variety of different forms including raw, processed, and analysed data, and published and archived data.

• More specifically, research data are defined in regulation and directives: – "Intangible property - Code of Federal Regulations 2 CFR 200.315," 2014, and – further in United States Government Directives "2 CFR 215 - Uniform Administration Requirements for Grants and Agreements With Institutions of Higher Education, Hospitals, and Other Non-Profit Organizations (OMB Circular A-110)," 2012.

• Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings. 33

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Repeatability – Principle of Scientific Method • Leveraging finite resources and to adhere to the principle of the scientific method that all experiments must be repeatable: – We, as a scientific community must investigate ways to establish environments where experiments can be repeated. – We can no longer allude to from where the data come, we must add rigor to the data collection and data management process from which our analysis is conducted.

*http://www.ventanaresearch.com/blog/commentblog.aspx?id=3654

• Data management involves all stages of the digital data life cycle including capture, analysis, sharing, and preservation. The focus of this statement is the sharing and preservation of digital research data.

34

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Procedure for a Computing Environment to Support Repeatable Scientific Big Data Experimentation • A data management plan is a formal document that outlines how a research institution and program will handle data both during research and after the project is completed –

The goal of a data management plan is to consider the many aspects of data management, metadata generation, data preservation, and analysis before the project begins.



This ensures that data are well-managed in the present and prepared for preservation in the future.



Multiple United States government agencies now require proposals submitted to include a supplementary document labelled “Data Management Plan. These supplementary documents describe how the proposal will conform to scientific policy on the dissemination and sharing of research results.

• FUSEnet, a data analytics cloud specializing in managing both data and computational processes for assessing technical knowledge for identifying emergent technologies and capabilities. • FUSEnet was a government system hosted by ORNL that stored unclassified, copyright-protected scientific information and provided remote access for approved users to analyse the stored data within a cloud computing environment to satisfy the research objectives of the IARPA FUSE Program.

35



A key tenet within FUSEnet was that data integrity and availability was maintained.



An ORNL developed “data diode” embedded within FUSEnet gateways allows access to protected data, but prevents data removal by users.



As necessary, a mechanism for approved data export is built into the system architecture.



Also by design, the activities and work products of individual user teams were segregated from each other in the cloud computing virtual environment. IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

FUSE Team

“Finding Patterns of Emergence – Foresight and Understanding from Scientific Exposition (FUSE),” D. Murdick, 9 Jan 2014

36

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

FUSEnet Environment and Capabilities • Big Data analytics cloud hosted at ORNL • Large-scale computational and storage infrastructure • High-quality, commercial data: 120 million+ documents – IEEE, Nature, Elsevier, Web of Science, Lexis-Nexis – English and Chinese

• Security Diode to protect user workspaces and data • Virtualized computing space through VMware; 200+ virtual machines • Unclassified system with remote access for approved users 37

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

FUSEnet Environment and Capabilities • The FUSEnet computing environment is based on the Cloud service model. These models are usually described by a three layer classification called SPI for SaaS, PaaS, and IaaS as follows: – SaaS – Software as a Service: applications that are available on-demand. – PaaS – Platform as a Service: refers to a computing platform of software components and middleware that are used by end-users to develop and manage their cloud applications. Typically, cloud providers at this layer offer databases, web servers, development environments, and application monitoring tools. – IaaS – Infrastructure as a Service: physical or virtual machines with access to data storage and other operating system services. The cloud user is typically expected to install and maintain operating-system images.

DETAiLS

38

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

FUSEnet Capabilities (SaaS Level) • At the SaaS level, four unique software applications perform automated technical assessments for supporting the detection and forecasting. The four main applications developed within the FUSEnet system were: Technical Emergence Software

39

Technical Summary Description: Performs the detection of technical emergences through …

ARBITER (BAE Systems)

… feature extraction using deep NLP1 and storing the features in RDF2 database. Features are combined into higher level time-based indicators that facilitate the detection and forecasting for technical emergence.

Copernicus (SRI International)

… the use of historical analysis of related papers and patents. Entities are discovered during ingestion and stored in MongoDB3. Features are derived from entities and are aggregated into a set of time-based measurements for time-series analysis.

Emerge (BBN)

… Topic modeling and clustering of the topic with the use of statistical models to perform emergent detection and prediction. Stores features in MongoDB.

DETAiLS (Columbia University)

… the generation of document shards, employment of NLP resulting in annotation. Analysis involves sentiment, network, and time-series. IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

DE TAi LS

FUSEnet (PaaS Level) • The SaaS software applications use a variety of tools and libraries at the platform (PaaS) level. • While the SaaS level in FUSEnet is the automated assessment, the FUSEnet PaaS computing platform can best be described as a Network Analysis and Text Analytics platform. • Text analysis uses statistical pattern learning to find patterns and trends from text data (in our case, scientific literature and patents).

• A summary of several key tools that FUSEnet are included to the right.

40

FUSEnet PaaS Analytics Tool

Technical Usage

1

MySQL

2

MongoDB

3

MALLET

4

Sofia-ml

5 6

8 9 10 11 12

Lucene IR system Scikit-learn Tomcat/Solr Web Server Apache ActiveMQ Cassandra Virtuoso OpenRDF/Sesame Spring Framework

13

Lucene/Solr

14

Open NLP

15 16 17

Netica Elasticsearch Hadoop 2+

18

Berkeley Parser

Sorts and assigns words in sentences into subjects, verbs, and objects.

DETAiLS

19

Duke Stanford Chinese Word Segmenter Stanford Part-ofSpeech (POS) Tagger

Deduplication engine written in Java operating with Lucene.

DETAiLS

Split Chinese text into a sequence of words.

DETAiLS

Reads text and assigns parts of speech to each word (noun, verb, adjective, etc.).

DETAiLS

7

20 21

22

UIMA

23

Weka

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

SQL database typically used to store document, term, and author data.

SaaS application that uses it Emerge, ARBITER Emerge, Copernicus

Document-oriented, NoSQL database used to store extracted entities and indicator-specific data. Machine Learning and NLP Toolkit for Java. Provides topic modelling for document clustering. Fast incremental machine learning algorithm. Provides clustering of documents from topic models generated by MALLET. Used for its indexing engine. Machine learning models.

Emerge Emerge Emerge Emerge

Used for Term indexing.

ARBITER

Messaging and integration patterns. NoSQL database. RDF triple storage. RDF processing including parsing, storing, reasoning and querying. Used for Integration using JMS.

Machine learning based toolkit for processing natural language text.

ARBITER ARBITER ARBITER ARBITER ARBITER ARBITER, DETAiLS ARBITER

Used for working with belief networks and influence diagrams. Extension on Lucene that provides search and analytics. Used for extract, transform, and load (ETL) and de-duplication processing.

ARBITER Copernicus Copernicus

Document level information search, retrieval and storage engine.

Unstructured Information Management Architecture (UIMA) is a general framework for analysis of unstructured information and its integration with search technologies. Machine learning software written in Java for data analysis and predictive modelling.

DETAiLS DETAiLS

FUSEnet (PaaS Level) • The applications use a variety of tools and libraries at the platform (PaaS) level. • While the SaaS level in FUSEnet is the automated assessment, the FUSEnet PaaS computing platform can best be described as a Network Analysis and Text Analytics platform. • Text analysis uses statistical pattern learning to find patterns and trends from text data (in our case, scientific literature and patents). FUSEnet Platform as a Service (PaaS)

41

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

DE TAi LS

FUSEnet (PaaS Level) • A selection of software libraries for Social Network Analysis and text analysis 1 2 in FUSEnet, available for 3 ensuing that experiments 4 5 can be repeated.

42

Library/Package Arpack JDOM Jwnl Matrix-toolkits-java BLAS

6

LAPACK

7 8

Libquadmath Beanshell

9

Trove4j

10

JGrapht

11

JUNG

12

R

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Description Linear algebra routines for Java XML processing library for Java Java WordNet library Linear algebra data structures for Java Linear algebra subroutines Linear algebra data structures and subroutines High-precision math libraries Scripting for Java High-performance data structures for Java Graphical data structures and algorithms for Java Java Universal Network/Graph Framework Development environment for statistical computing and graphics

SaaS application that uses it Emerge Emerge Emerge, ARBITER Emerge Emerge Emerge Emerge Emerge Emerge Emerge ARBITER ARBITER

FUSEnet (PaaS Level) • The FUSE platform data sets include: – bibliographic citations of journal articles (108+ million) – Chinese (CNKI) and Thompson Reuters’ Web of Science (TR WoS) – full text journal articles (5+ million), – patent records (14+ million at beginning of 2013 for the US [USPTO] and China [SIPO]), and updates records, (51+ million for the US and China).

Growth estimated at ~35k unique docs/month for FUSE; worldwide ~800k docs/month

• Large increase in scientific journal articles and patents applications illustrated

in the FUSE research system during the past two decades.

43

– The number of Chinese patent applications is increasing dramatically and has now surpassed the number of US patent applications. – Also, the number of Chinese journal articles is increasing at a rate faster than the rest of the world. IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

FUSEnet Capabilities (IaaS Level) • The FUSEnet computing environment is based on the Cloud service model. Current Specifications: – 770 gigaFLOPS of maximum performance – 16 blade servers (plus 2 support blades), each with 2 CPUs, each with 6 cores, totaling 192 cores (processors) – 3.07 TB of RAM w/ 192 GB per node – Disk space: FUSEnet 2.0 Physical Network Diagram

10GE Switch 1 10GE Switch 2 10GE Switch 3 10GE Switch 4

Isilon Internal Network

2x

Isilon Chassis 1

15x

15x

2x

2x

Isilon Chassis 2

• EMC Isilon: 440 TB (5 Isilon nodes) running NFS over 10 Gb/s Ethernet

– Networking: Flex-10 modules totaling ≤160 Gbits/sec bandwidth per enclosure x 2 enclosures – Virtualized computing space through VMware • Access and control policies are enforced by ORNL • Call Center and metrics for service quality • 24/7 operational performance at 99.8% availability 44

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Left Hand

16x

16x

Isilon Chassis 3

Ethernet iLOMgmt

• HP LeftHand: 260 TB of effective disk storage used for data backup • Isilon disk I/O is roughly 3-10x improvement over the LeftHand Storage

1x

1x

1x iLO

1x Host

1x Host

1x iLO

2x 10GE ORNL

C7000 Chassis 1

C7000 Chassis 2

20120109 - leverman

Analysis of Technical Requirements and Alternatives versus Commercial Cloud Providers

• Representative current cloud solution offerings from several commercial vendors. • Considering the data management, experimentation requirements and the strategic issues, the question arises:

Vendors 1

Amazon Web Services

2

Google Cloud Platform IBM SmartCloud

3

4

Microsoft Azure

5

Rackspace Cloud

Cloud Offering Overview

Applicability to FUSEnet

Overall market leader offering virtual servers, MapReduce (Hadoop) for search engine, large data storage, SQL databases, NoSQL databases, mobile integration, business applications including email, payment systems, and workflow. App Engine web application platform (PaaS), virtual machines, file storage, SQL databases, NoSQL, big dataset support, mobile integration. SaaS including data warehousing and analytics, business analytics engine, business process management, financial modelling, payment systems, medical analysis, social media analysis, transportation management, medical analytics, SQL databases, NoSQL databases, mobile integration. Windows or Linux virtual machines, messaging, scheduling, SQL databases, NoSQL databases, mobile integration. High bandwidth networking, virtual machines, data storage, process load balancing.

PaaS (databases), IaaS

Are the IaaS and PaaS from these selected vendors sufficient for hosting and maintaining the FUSEnet SaaS and PaaS? 45

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

PaaS (databases, web apps), IaaS SaaS (social media analysis), PaaS (databases, web apps), IaaS

PaaS (databases), IaaS IaaS

Analysis of Technical Requirements and Alternatives versus Commercial Cloud Providers

• Analysis of SaaS Technical Alternative – FUSEnet consists of four unique technical emergence software applications. – Current cloud providers are not in the business of providing this niche capability. – Cloud providers offer more general SaaS services such as: • • • •

Enterprise Resource Planning (ERP), general accounting, medical, and financial applications for managing business administration operations.

– If FUSEnet were to be employed on a 3rd party cloud, • unique, domain-specific expertise would be required to operate and manage the FUSEnet software applications.

46

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Analysis of Technical Requirements and Alternatives versus Commercial Cloud Providers (in the context of PaaS)

• FUSEnet consists of several framework and middleware solutions combined with math-based libraries that are unique to network and text analysis. • With the exception of IBM SmartCloud, current cloud providers are not in the business of exclusively providing this niche capability. – The features of the network and social analytics tools in SmartCloud should be further evaluated.

• Cloud providers offer more general PaaS software such as – databases, email, and web servers. 47

•.

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Analysis of Technical Requirements and Alternatives versus Commercial Cloud Providers (in the context of IaaS) Strategic Issue 1

Cloud Scope – what is the design to meet the need?

Description Identifies the availability, performance, and security needs; sufficient and planned computing power, storage, and bandwidth.

Assessment for FUSEnet FUSEnet is monitored daily and reported monthly with the current operational stats: • Availability: 99.8%; CPU usage: 12-18%; • Memory usage: 56-65%; Storage usage: 69%. FUSEnet is installed with a Data Diode that protects against data exfiltration of its repository. FUSEnet is a virtual environment with separated computing enclaves. Each user or user group within an enclave has the freedom to compose and perform their needed computational research without directly impacting other users.

2

Service Levels

Identifies the expected workload, admin support, service delivery needs, timing and I/O response.

FUSEnet Test and Evaluation (T&E) simulates heavy end-user loading. This is measured to be an increase of 5-10% of the daily load. For its initial usage, FUSEnet could simultaneously host 3-4 heavy end-users loading. The Admin support is at two levels: operating system and the virtual layer through VMware.

3

Deployment Needs

Identifies the integration needs FUSEnet operates on VMware that isolates the PaaS from dependencies on the with infrastructure services. hardware and the Operating System. The current FUSEnet system, including the number of cores, performance of the cores, memory, and the Isilon storage, is a proven baseline for simultaneously hosting 3-4 heavy end-user loading.

In general, commercial, academic, and government entities are advised to consider strategic issues with regards to cloud scope, service levels, and deployment needs. 48

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Analysis of Technical Requirements and Alternatives versus Commercial Cloud Providers (in the context of IaaS) • FUSEnet was operated in a secured, cloud environment at the Data Computing Center. – This FUSEnet hardware was performance tested to determine its disk I/O (input/output). – Software programs were used to perform these tests at a low level or ‘raw’ I/O set of read and write tests and at the application layer with tests that simulated application disk usage.

– From these initial test results and further repeated testing, the FUSEnet disk I/O was optimized for handing the volume and type of data used in the system. Further tests were performed to compare FUSEnet with another commercial cloud offering, which demonstrated similar or better performance for FUSEnet depending on the operating conditions selected.

• The FUSEnet software and data can be operated on 3rd party (IaaS) environments that can meet the overall system requirements as follows: – Handle big data that is mixed structured and unstructured and continuously growing. – Protect selected data and apps (commercial, proprietary) that remain in the cloud. – Rapidly deploy software solutions to the data and rapidly ingest data into the system. – Provide virtualization for operating systems including common Linux distributions, Windows and Mac OS.

– Provide the computing performance involving big data analytics software services. – Provide an easy-to-use big data analytics platform. – Provide high-performance big data storage and retrieval up to 500 TBs and continue to scale. – Provide robust, state-of-the-practice cyber security. 49

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Conclusion and Future Directions • Impact of Big Data was identified – The Changing Nature of our World and Research – Data Explosion – Business of Big Data

• Computational Intelligence Aspects of Cyber Security were articulated

• A Computing Environment in Cyberspace to Support Repeatable Scientific Experimentation was proposed.

50

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Conclusions and Future Directions (cont.) • This computational capability ensures the integrity, availability and confidentiality of new technologies and new technical knowledge. • This will position scientific investigators (academic, commercial, and government) with an advantage to address the technical and political challenges all three entities face. • FUSEnet offers this unique capability – – a computing environment necessary to support repeatable experimentation, and – can be housed at appropriately configured Data Center, in order – to provide value to investigators from a variety of sources while adhering to recently mandated Data Management Planning.

Closing thought – if we as a research community are to made a significant impact, we must adhere to the tenets of the scientific method. 51

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Robert (Bob) K. Abercrombie, Ph.D. Director, Cybernomics Laboratory Department of Computer Science Center for Information Assurance FedEx Institute of Technology University of Memphis Memphis, TN 38152 Email: [email protected]

52

IEEE SSCI 2015 / CICS 2015 Keynote December 8-10, 2015 – Cape Town, South Africa

Suggest Documents