Unstructured Data Workflow: A Functional Overview ...

3 downloads 33421 Views 698KB Size Report
encountered the big data development is the unstructured data manipulation, since today ... effective methods and tools in order to process them and extract the ...
Unstructured Data Workflow: A Functional Overview and Security Issues Fadoua Khennou TTI Laboratory, Higher School of Technology Sidi Mohamed Ben Abdellah

Youness Idrissi Khamlichi REIS Laboratory, Faculty of Science and Technology Sidi Mohamed Ben Abdellah

Chaoui Mejhed Nour El Houda TTI Laboratory, Higher School of Technology Sidi Mohamed Ben Abdellah

University, Fes-Morocco

University, Fes-Morocco

University, Fes-Morocco

[email protected]

[email protected]

[email protected]

workflow while addressing some security issues encountered in each layer.

Abstract—Nowadays the amount of data that is being generated every second is very huge, and dealing with it has become more challenging than ever. One of the important challenges that has encountered the big data development is the unstructured data manipulation, since today 80% of data are unstructured yet only 20% is structured as Merrill lynch reported. In addition to that, unstructured data provides a rich source of information about people, households and economies, and while these significant data are present in abundance, we need effective methods and tools in order to process them and extract the meaningful information. Our main purpose through this paper is to define in depth the main issues that one encounter once dealing with these source of data, and shed light on some of the existent approaches that are implemented from the data acquisition to the preprocessing layer, while providing a functional view of the unstructured data workflow. Index Terms—Big Data, unstructured data, acquisition, preprocessing.

II. RELATED WORK This section reveals recent progress and efforts in unstructured data management, in fact, there are many researches as states of the art that study the challenges encountered in the big data development [1] [2], in this case they highlight their research on addressing challenges and technologies, in this paper we try to present our view of a functional workflow that will be common to any big data framework regarding the unstructured data. The main goal of shedding light on unstructured data is related to the huge amount of data that is being generated every second from multiple heterogeneous sources, and in order to get the meaningful information from these kind of data with the analytic approaches we need firstly to go through their storage and processing via the presented layers. In this case , [3] the researchers have presented a very general state of the art of the challenges encountered on each layer for all types of data. Given the number of challenges faced in this big data era, [4] have focused their work on presenting a careful balance between the threats and opportunities while taking into account the privacy of the user.

I. INTRODUCTION The term big data can be defined as the emergence of important technologies that arise from several trends (health, finance, statistics, embedded systems ...) as these data can be structured and unstructured, their processing is still more complicated. In fact it requires a well-studied workflow to reach a mature system that responds largely to several constraints. Thomas davenport quoted in his book “More than the amount of data itself, the unstructured data from the web and sensors is much more salient feature of what is being called big data”, that is to say the need of a good management of the unstructured data is crucial in order to reduce storage and compliance costs. This people-oriented content takes many forms and it is often ignored or forgotten by many organizations regarding the structured data due to its massive amount and unstructured scheme. In this paper we define firstly the benefit of NoSQL databases along with the use of unstructured data, in section IV we present a functional view of unstructured data

III. UNSTRUCTURED DATA MANAGEMENT A. NoSQL databases The exponential growth of digital information made possible the development of several tools and solutions to manage effectively our data in terms of: processing, storage and analysis techniques. Yet traditional tools do not respond to the massive amount of information that must be processed, in fact this is generally related to structured rules and schemes and always require some atomicity and consistency of stored data. NoSQL (not only SQL) was in fact created to allow an open approach to the management of these kind of databases. Instead of using structured tables and store multiple attributes in columns, NoSQL databases use the key / value concept.

1

suited to their management. Unstructured data can be generated from various sources: a) Satellite images It includes meteorological data, and data from satellites supervision, such as Google earth. b) Scientific data It includes the data of the high energy sciences (seismic, atmospheric data ...) c) Photographs and videos It includes digital surveillance data and video files. d) Data from sensors It includes data from sensors that are typically used in vehicles, weather, or others. e) Social Media The data is generated from social media platforms such as YouTube, Facebook, Twitter, and LinkedIn. f) Mobile data This generally includes data from mobile devices such as text messages, location data. g) Web content This can come from any website whose content is unstructured, such as YouTube, Instagram and others. In the era of big data the major issue that arises is primarily related to the management of unstructured data. Since these have a heterogeneous format that makes data mining step more complicated, many researchers have tried to suggest solutions to enable effective management of data while ensuring system performance. In [7] this problem was decomposed into three complementary parts: -The heterogeneity of data -Non mapping data -The need to deploy different APIs As previously described the concept of NoSQL is present today to respond to such problems, however the main concern remains to study is the analysis of the oriented documents databases. That is to say several studies [8] [9] in this context were deployed to propose approaches for this type of storage. Note that there are several projects and NoSQL applications: Cassandra: Is a NoSQL database based on key-value system, where the value is then stored in a column. CouchDB: is a NoSQL database based on key-value system where the value is stored in JSON documents. It uses a combination of HTTP, JavaScript, and Map / Reduce for questioning. Hadoop & HBase: Is a complete ecosystem of integrated distributed computing tools, which uses the file system (HDFS) and a programming framework (Map / Reduce). MongoDB: A NoSQL database based on key-value system where the value is a JSON document and has its own unique query language. There are many applications of NoSQL databases but each one differ from another based on their specifications and use cases. Here we can resume a comparative study of the main NoSQL databases [10]:

Simply, there is no scheme of the database. It stores the values for each key provided, distributes them through the database and allows their efficient recovery. The absence of a scheme prevents the use of complex queries and NoSQL as a transactional database environment. It can be divided into four categories: 1) Key / value Each key is a value that is stored in the database for a later retrieval. 2) Documents A key is a set of fields and values, which are stored in a hierarchical manner. This type assumes that the stored values are structured documents and self-descriptive, type XML, JSON or other, that can be examined. An e-commerce website that needs a flexible scheme to store the description of its products could use this type of solution. 3) Columns A key is a set of columns, each with a value, they are structured in rows and columns as in the RDBMS, the records are, however, assembled in groups of columns and aggregation operations are very efficient. 4) Graphs Data modeling as a node format that are linked together and they allow the storage of entities connected by directional associations with properties. B. NoSQL applications and unstructured data With a relational database, the only option is to store data using the same configuration as the tables and all the data used must meet all the requirements. That does not leave much place for dynamic information. NoSQL databases are generally able to recover a large set of data more efficiently than relational databases. An example of its use has been studied in [5] to solve performance issues related to the management of data from geographic information systems (GIS), the researchers analyzed the characteristics of NoSQL database, with the use of GIS massive data, offering an effective approach to store these data in a MongoDB database (oriented documents) using python script. A state of the art and a comparative study was also performed in [6] to give a general idea about the different tools exploiting NoSQL database. The architecture of NoSQL databases is a major advantage when it comes to redundancy and scalability. In addition to that the use of NoSQL provides assistance in the management of unstructured data that can be quite a complex task when it comes to a large mass of data. These are data that do not follow a specific format that conforms to a predefined template. They can be text as Word documents, PowerPoint presentations, instant messaging, collaboration software, documents, books, social media posts, medical records and non-textual as those created in the media, such as audio files MP3, JPEG and Flash video. In a business that generates a large volume of data, they can represent a high percentage compared to structured data, and for this reason their management is becoming more critical and we need new softwares and tools that are more

2

main purpose of this step is to gather a high volume of data from heterogeneous sources. Internet is represented as the ultimate data source and the most exploited one, data from social networks, comments, forums, groups, or others, can be generated and used thereafter. This step can be performed either by longitude, that is to say, by generating data linked from the same source, or by parallelism in a distributed way. This layer has also been studied in [1] [11] and was defined through three sub layers: the collection, transmission and data preprocessing. 1) Logs collection They are stored in files that list events such as visited pages, button clicks, connections, exceptions. Some of these log lines can be structured, whereas the other part may be completely unstructured with any information the developers of the application choose to include. There are several data collection tools for logs: Apache Flume [12]: is a distributed system, and reliable service data collection, it allows the aggregation and the transmission of large amounts of data flow events. Scribe for Facebook: describes a log data aggregation server in real time from a large number of servers. It is designed to be scalable, extendable and robust. It was developed by Facebook and was released in 2008 as open source. Chukwa of Hadoop [13]: is an open source data-collection system for the monitoring of large distributed systems. It is built on top of Hadoop Distributed File System (HDFS) and Map / Reduce and inherits the scalability and robustness of Hadoop. It also includes a flexible box and powerful tools to display the monitoring and analysis of results for better use of the data. 2) Unstructured data collection Data such as emails, status of social networks, documents, images and videos are considered unstructured data [2]. Emails and binary file formats have a defined header using metadata. However, the content is completely unstructured, and can appear in the form of free text or binary bits and bytes. Collecting this data helps us understand their exact structure, a framework [14] was developed in this context which includes techniques and methods for deploying flows and treatments for scientific data that are stored in disparate formats. Biometrics is also another area that uses unstructured data, specifically images. Fingerprints and facial images are processed to extract structured attributes, for example, fingerprints are transformed into lines and polygons. Biometric comparison is then performed using structured attributes rather than raw data. In addition to that there are several methods and tools to collect the unstructured data such as web crawlers or spiders [3] [15] which aim to bring together web pages from multiple web servers connected to a network. The result is a unified file that can be used later to restore the desired information.

TABLE I. Comparative study of NoSQL databases NoSQL engine HBase

Type of data Columns

Characteristics

Use case

Big Data

Excellent performances on large volume of data.

Cassandra

Columns

Big Data

Linear scalability on large volumes of data.

CouchDB

JSON documents

Functionalities

Web development without the need for excellent performance.

MongoDB

JSON documents

Sharding and replication

Easier scalability on semi-structured data.

Risk

Key-value

Decentralized

Redis

Key-value

Performances

Excellent performance and very easy scalability by adding nodes. Between the cache of the database system. Provides excellent performance on simple data processing.

When choosing a database engine we have to consider not only the technology but also the needs of the organization visa-vis the amount and the type of data. Whenever possible, this choice must also be based on internal testing and taking into account the constraints of the business. This is both a decisive choice, because it is difficult to go back when all your data is managed in a system. So everything is related of wither this business is in a need of a whole ecosystem that manage the entire workflow or only a part of it. IV. A FUNCTIONAL VIEW OF UNSTRUCTURED DATA WORKFLOW

In the following scheme we present the unstructured data workflow and address the characteristics of each layer:

Collection

Transmission

Preprocessing

Web crawlers

Nerwork interconnexion

Cleansing

IP backbone

Compression

Data center network

Integration

Logs files(sribe, flume,chukwa...)

Sensors

Network protocol TCP,UDP

Discretisation Transformation

Fig. 1. Big data workflow

A. Acquisition

B. Transmission When acquiring massive data, we need to use an effective

The acquisition of data is a preliminary step of this workflow, it describes all distributed sources that allow to generate all types of data (video, images, sensors ...). The

transmission mechanism for sending data to an appropriate storage management system to support various analytical applications [16].

3

the need of a preprocessing layer is crucial in order to perform compression and filtering functions that are essential to ensure effective data storage. The main tasks that must be performed during the stage of data preprocessing are presented as follow: Data cleaning: to resolve inconsistencies, identify missing and noisy values. Data Integration: this allows to combine data from multiple sources into one coherent data base. Transformation: allows for standardization and aggregation. Compression: provides a reduced representation in volume guaranteeing similar analytical results. Quantification of data: reduces the size of certain type of data. When dealing with big data we have to consider many other issues than the management of data itself, the security is also a major issue that one has to deal with in order to assure privacy and a high protection of the data. There are many considerations that we have to take into account and after the preprocessing layer we should consider the challenge of monitoring data in real-time, this is related to the generation of alerts to determine who did what? Does the system is under attack? And who accesses what data and when? Indeed, since we are dealing with massive stored data, there must be a daily security audit, that must be implemented taking into account a distributed infrastructure. Real-time security management will enable companies and organizations to react immediately regarding time attacks, it also helps to strengthen cyber defense for a company that manages a large mass of data.

The need of adding security measures in this layer is very prominent, indeed in order to assure that sensitive data are well protected, when generating data for big data management, a need for an encryption algorithm is very crucial, no one should be able by then to collect the information and affect the integrity or the confidentially of data. Another problem that is persisting concerns the processing over data in cloud computing. A solution for this is the use of fully homomorphic encryption [17] scheme in order to perform treatment over encrypted data in a distributed framework. We present the data transmission phases in the following scheme: Encrypt

Encrypt

Data sources

IP Backbone

Data center

Fig. 2. Data transmission phases

Phase 1 : Generation

a) IP backbone To ensure proper routing of data in the right conditions, we need interconnected networks that deploy an optimal infrastructure. This network should be concerned not only in the delivery of data but also in any routing segments for the life cycle of large data. For each layer of the network, the requirements for the transmission of data must be satisfied. The IP backbone is an intermediate layer connecting the data sources to the data center networks, this layer has to assure high speed transmission with low latency. b) Data center transmission In a data center, the data is processed and analyzed with computational tools distributed as Map Reduce. Transmission of data on different nodes of these servers require a well deployed interconnection to ensure the reliability and

Exploitation

Phase 2 : Acquisition

Phase 3 : Prepocessing Storage

Phase 4 : Processing

Extract information

availability of information. Xiaomeng Y. et al. [16] performed a comparison of the different approaches used during transmission from an IP backbone to a data center. They identified the main challenges of big data applications by defining a state of the art in order to address the problem of network transmissions for big data.

Phase 5 : Analysis

Fig. 3. Functional view of unstructured data workflow

In this paper our work was focused on the first three layers that describes the generation, acquisition and the preprocessing patterns, our main goal was to define the issues encountered in each layer.

C. Preprocessing The collected data can sometimes include redundant or unnecessary data, it allows outskirts increase storage space and subsequently affect the time of the analysis. That’s why

V. CONCLUSION In this paper we have presented the main issues encountered when dealing with unstructured data infrastructure, our main

4

work was highlighted in the presentation of a functional view of unstructured data workflow. Dealing with Unstructured data exposes a major problem for researchers nowadays, since they are derived from heterogeneous sources , and their use by data mining tools becomes very critical which requires effective methods to exploit data in the best way. As a future work, the need to study the processing layer is very crucial, and we intend to use hadoop ecosystem in order to study in depth the issues encountered on the node transmission and emerge it with a security layer to assure both privacy and optimization of the processing.

[11] K. Ohbyung, L. Namyeon and S. Bongsik, "Data quality management, data usage experience and acquisition intention of big data analytics," in International Journal of Information Management, 2014. [12] A. Flume, "The Apache Software Foundation," [Online]. Available: https://flume.apache.org/. [13] Chukwa, "The Apache Software Foundation," [Online]. Available: https://chukwa.apache.org. [14] K. Verena , "A Holistic Framework for Big Scientific Data Management," in IEEE International Congress on Big Data, 2014 . [15] F. Rafael , F. Fred , B. Patrick , M. Jean , L. Rinaldo and . C. Evandro, "RetriBlog: An architecture-centered framework for developing blog crawlers," in Expert Systems with Applications, March 2013. [16] Y. Xiaomeng, L. Fangming, L. Jiangchuan and J. Hai, "Building a Network Highway for Big Data:Architecture and Challenges," July 2014. [17] G. Craig , "Fully homomorphic encryption using ideal lattices," in Symposium on the Theory of Computing (STOC), 2009.

REFERENCES [1] L. Zaiying , Y. Ping and Z. Lixiao , "A Sketch of Big Data Technologies," in 2013 Seventh International Conference on Internet Computing for Engineering and Science, Shanghai Sanda University. [2] M. H. Padgavankar and S. R. Gupta, "Big Data Storage and Challenges," in M.H.Padgavankar et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 22182223. [3] H. HAN, W. YONGGANG and C. TAT-SENG, "Toward Scalable Systems for Big Data Analytics:A Technology Tutorial," in IEEE Access, 2014. [4] C. Richard and C. Peter , "Is “Big Data” creepy?," in computer law & security review 29 (2013) 601e609, 2013. [5] Z. Xiaomin, S. Wei and L. Liming, "An Implementation Approach to Store GIS Spatial Data on NoSQL Database," in Geoinformatics (GeoInformatics), 2014 22nd International Conference on, Kaohsiung. [6] N. G. Venkat, R. Dhana and V. Vijay, "NoSQL Systems for Big Data Management," in 2014 IEEE 10th World Congress on Services. [7] K. Richard and D. Ralph, "Unstructured Data Extraction in Distributed NoSQL," in Digital Ecosystems and Technologies (DEST), 2013 7th IEEE International Conference on, Menlo Park, CA. [8] K. L. Richard and D. Ralph, "Data Mining from NoSQL Document-Append Style Storages," in Web Services (ICWS), 2014 IEEE International Conference on, Anchorage, AK. [9] K. L. Richard and D. Ralph, "Topics and Terms Mining in Unstructured Data Stores," in Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on, Sydney, NSW. [10] B. Rudi, Les bases de données NoSQL: Comprendre et mettre en œuvre, Eyrolles ed., France: Eyrolles, 2013, p. 300.

5