a big data processing framework based on ...

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

A BIG DATA PROCESSING FRAMEWORK BASED ON MAPREDUCE WITH APPLICATION TO INTERNET OF THINGS 1

Heba, A., 2 Mohammed, E., 3 Shereif, B.

1

Information Systems Dept., Faculty of Computers and Information, Mansoura University Mansoura, Egypt, [email protected] *2 Information Technology Dept., Faculty of Computers and Information, Mansoura University, Mansoura, Egypt, [email protected] 3 Information Systems Dept., Faculty of Computers and Information, Mansoura University, Mansoura, Egypt, [email protected]

ABSTRACT Massive and various data from the Internet of Things (IoT) generate enormous storage challenges. The IoT applications caused an extensive development. In the past two decades, the expansion of computational asset had a significant effect on the flow of the data. The vast flow of data is identified as "Big data," which is the data that cannot be managed using current ordinary techniques or tools. If it is correctly handled, it generates interesting information, such as investigating the user's behavior and business intelligence. In this paper, the proposed system is implemented to handle massive data with all forms of data resources whether structured, semi-structured, and non-structured altogether. The results and discussion show that the proposed system generates a feasible solution in applying big data IoT-based smart applications. In the data preprocessing stage, we used the K-nearest neighbors (KNN) technique to clean noisy data and a Singular Value Decomposition (SVD) to reduce data dimensionality. In the processing stage, we proposed a hybrid technique of a Fuzzy C-mean and Density-based spatial clustering (FCMDBSCAN) to deal with the applications with noise. The clustering technique is implemented on MapReduce model. MapReduce is represented as the most admitted framework to operate processing on big data. The MapReduce is the most principle model to deal with big data. The used technique is providing scalability, rapidity, and well-fitting accuracy for storing big data. In addition, it is obtaining meaningful information from huge datasets that give great vision to make effective outcomes using fast and efficient processing platform. Experimental results show that the accuracy of the proposed framework is 98.9% using IADL activities dataset. KEYWORDS: Internet of Things (IoT); Big data; Singular Value Decomposition (SVD); FCMDBSCAN; MapReduce.

1. INTRODUCTION The IoT is the connection that joins items to the Internet over varieties of view information devices. Therefore, all objects that can be addressed separately can interchange information among each other, and eventually realize the aims of perspective recognition, location, tracking, supervision, and administration [1].

2

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

Figure 1. The Big data in IoT. The essential thought of IoT is to interface all things on the planet to the Web. It is normal that things can be recognized automatically, can speak with each other, and also can even settle on choices without human interference [2]. Fig ure 1 shows the relationship between IoT and big data and how the sensor data represented as big data. Data are a standout amongst the most important parts of the IoT. In th e nature of the IoT, data are gathered from various types of sensors and speak to billions of objects. In all considered things, the data on the IoT display the next challenges: - The massive scale of the IoT: It includes a huge number of discernment devices. These devices are consistently and consequently gathering data, which prompt a quick development of information scale. - Different of observation gadgets: They inspect varied resources and heterogeneity of the IoT data. The gathered data from distinctive d evices and measures have different semantics and structures. - Interoperability: It indicates the way that the vast majority of the IoT applications are currently secluded. In the long run, the IoT will need to accomplish data distribution to encourage communitarian among diverse applications. Taking telemedicine benefit as an instance, once a patient is in crisis, the movement data is likewise expected to evaluate the landing

3

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

interval of the rescue vehicle to choose what sort of assistant medical strategy to take. - Multi-dimensionality: It is considered as the principal issue in the applications of the IoT. For the most part, it incorporates a few sensors to all the while screen various pointers, such as temperature, dampness, light, and weight. Along these lines, the specimen information is typically multidimensional [1]. Data are extensive in volume, so they are asserted in a mixed bag or moved with such speed, which are called "Big data." It is not a thing; it is a thought or ideal model that characterized the expanding, gathering, and utilization of huge measures of dissimilar information. Big data is helping in choice making and taking the business to a different universe [3]. Big data started to be the point of view when the standard database frameworks were not prepared to handle the unstructured data, such as weblogs, features, photographs, social overhauls, and human conduct. They are produced by online networking, sensor devices, or from some other data creating sources.

Figure 2. The Big data 4Vs and data sequence. Figure 2 observes the big data 4Vs that includes volume, velocity, variety, and veracity. It also describes the big data sequence. Some issues and technologies are identified with the accessibility of greatly substantial volumes of data that organizations need to join and get. There is a significant venture for a time, cash, and assets that are expected to make this style of processing ordinary.

4

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

The rest of this paper is structured as follows. Section 2 shows some basic concepts. Section 3 explains the current related work. Section 4 shows the proposed system and explains each phase in detail. In Section 5, the implementation results of the proposed techniques are discussed on a benchmark dataset. Finally, the conclusion and future work are presented in Section 6. 2. BASIC CONCEPTS 2.1 MAPREDUCE The Big data analytics society has admitted MapReduce as a programming template for handling massive data on separated systems. MapReduce model has become one of the perfect choices of the programming paradigm for processing massive datasets. It is a paradigm for evolving a distributed clarification for complicated difficulties over enormous datasets [4]. Users identify a map function that handles a pair of key-value to produce a group of intermediate key-value sets. In addition, It creates a reduce function that joins all intermediate values related with the same intermediate key. The MapReduce architecture is shown in Figure 3.

Figure 3. The MapReduce architecture.  MapReduce Algorithm There are four steps to implement MapReduce framework, which includes reading a large dataset, implementing the map function, implementing the reduce function, and returning the resulting data from the map and reduce. The mapper receives masses of data and produces intermediate results. The reducer reads the intermediate results and emits a final result.

5

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

A.

Read Large Dataset

Insert large dataset

Initialize datastore variable to store large dataset.

Select specific variables' names from the dataset.

Preview large dataset.

Add selected variables to datastore.

Figure 4. The block diagram of MapReduce read data. As shown in figure 4, we create a data store using the dataset with CSV extension. The data store displays a tabular datastore object for the data. Then, we select specific variables' names from the dataset. The selected variables' names feature permits working with the specified variables of the user's needs. The user can use preview command to retrieve the data. B. Generic Map Function Figure 5 shows the generic map function, which is considered as a general function for any key and value. This function enables the coder to set any pair of key-value for the selected dataset. We set intermediate key and intermediate value. Then, we subset the dataset at this specific value. Finally, we obtain set of key-value stored in the keyvalue store.

6

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

Key-value store

Data

Subset term

Generic map function

Initialize intermediate key and intermediate value

Subset data at specific value

Set intermediate key-value store

Adding set of intermediate keys and intermediate values to intermediate keyvalue store. Figure 5. The block diagram of generic map function. C. Map Function Data

Intermediate key-value store

Map function receives data and specific value

Set the condition of key-value pair Create output store to store all partitions of data that satisfy the key-value condition. Store all the results in output key-value store.

7

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

Figure 6. The block diagram of map function. Figure 6 illustrates the Map function that gets a table with the variables labeled by the selected variables' names property in the data store. Then, the Map function extracts a subset of the dataset that verifies the condition value of the selected key. D. Reduce Function Figure7 shows the Reduce function that receives the subsetted results gained from the Map function and simply merge them into a single table. The Reduce returns one key and one value. Intermediate value

Key-value store

Create Reduce function

Initialize output value variable.

Get all intermediate results

While has next results, add intermediate values to the output value.

Adding all output values to output key-value store. Figure7. The block diagram of reduce function. 2.2 DBSCAN ALGORITHM DBSCAN [5] is a clustering technique that depends on density. The thought is that if a specific point fits in with a cluster, it ought to be close to loads of different points in that cluster. The DBSCAN algorithm works as follows. First, two parameters are picked, a positive number Epsilon and a characteristic number minPoints. Then, start by picking a subjective point in the dataset. If there are more than minPoints points inside of a separation of Epsilon starting there, we consider every one of them to be a piece of a "cluster." Then, we extend that cluster by checking the greater part of the new points and checking whether they too have more than minPoints points inside of a

8

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

separation of epsilon. In the end, points are come out to be added to the cluster. After that, pick a new arbitrary point and repeat the process. Presently, it is entirely probable that the picked point is less than minPoints points in its Epsilon, and it is not a part of any other group. Therefore, it is viewed as a "noise point" that is not fitting in any group. The DBSCAN pseudo code is listed as follows: DBSCAN steps are as follows: 1. Design a graph whose items are the points to be clustered 2. For each core-point c make an edge from c to every point p in the neighborhood of c 3. Set N to the items of the graph; 4. If N does not include any core points terminate 5. Pick a core point c in N 6. Let X be the set of nodes that can be reached from c by going forward; 1. create a cluster containing X{c} 2. N=N/(X{c}) 7. Continue with step 4 2.3 FCM ALGORITHM FCM [6,7] is a data clustering procedure. The dataset is categorized to n clusters. Every data point in the dataset related to a cluster, which has a high level of relationship with that cluster. Another data point that remotely lies from the center of a cluster has a low level of association with that cluster. This technique is often utilized in pattern recognition. It depends on minimization of the objective function. The algorithmic steps for Fuzzy C-Means clustering is as follows: First, calculate the center of the clusters using the following equation [6]: Cj = ∑Ni=1 (Mmij * xi )/ Mm ij (1) Then, the objective function is calculated based on the membership matrix by the following calculation: Jm=∑Ni=1 ∑Cj=1 Mmij ||xi – cj ||2 Finally, the membership value is updated by: M ij= 1/(∑(||xi – cj|| / || xi - ck||))2/(m-1)

(2)

(3)

where m is a real number greater than 1, Mij is the degree of membership of xi in the cluster j, xi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and ||*|| is the similarity measure between any measured data and the center. FCM sequentially moves the cluster centers to the right area inside a dataset. FCM clustering strategies rely on fuzzy behavior, and they give a method that is normal to

9

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

produce a clustering where membership weights have a characteristic translation but not probabilistic at all. 2.4 K- NEAREST NEIGHBORS (KNN) In KNN regression, the outcome is the estimation of the item. This value is normal of the estimations of its k-closest neighbors. Hence testing the performance should be appropriate. The KNN computes the Euclidean distance from the query example to the labeled examples using the following equation [8]. D=√

(4)

Selecting the ideal value for K is best done by first reviewing the data. A large K value is more accurate as it reduces the overall noise. Then, labeled examples are ordered by the highest distance and find a heuristically top K-number of adjacent neighbors. Finally, search the data for the most likely instance. It does not lose any detail and compares every training sample to give the prediction. 2.5 SINGULAR VALUE DECOMPOSITION SVD receives a rectangular matrix of the data that are defined as A, where A is an n x p matrix, which the n rows represents the data and the p columns represent the experimental properties. The SVD theorem states that [9]: Anxp= Unxn Snxp VTpxp (5) Where UTU = Inxn (6) VTV = Ipxp (i.e. U and V are orthogonal) (7) where U has columns that are the left singular vectors, S is the same dimensions as A that contains singular values, and VT has rows that are the right singular vectors. The SVD represents an outline of the original data in a coordinate system where the matrix is diagonal. The SVD calculated by the equation: W = AAT (8) Wx=ƛx (9) The scalar  is called an eigenvalue of A, and x is an eigenvector of A relating to . The computation of the SVD consists of finding the eigenvalues and eigenvectors of AAT or ATA. The eigenvectors of ATA consist the columns of V, the eigenvectors of AAT represent the columns of U. Also, the singular values in S are square roots of eigenvalues from AAT or ATA. The singular values are the diagonal entries of the S matrix and are arranged in descending order. The singular values are always real numbers. If the matrix A is a real matrix, then U and V are also real. The SVD feature specifies the nearest rank-l estimation for a matrix. By putting the little singular values to zero, we can acquire matrix estimations whose rank meets the number of outstanding singular values. 3. RELATED WORK

10

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

Big data from IoT is considered as an important research topic. Many researchers are working in this field. For example, Tao and Ji [10] utilized the MapReduce technique to investigate the various little datasets. They proposed procedure for monstrous little data in light of the K-means clustering calculation. Their outcomes established that the suggested manner may enhance the information preparing proficiency. They use Kmeans calculation for data examination in view of MapReduce. Then, they utilized a record for the converging of data inside the cluster. The data in the same square have a high likeness when the merger is finished. The exploration will help them to plan a merger technique of little information in IoT. The DBSCAN algorithm can be suitable to be applied on big data because the number of clusters is not needed to be known in the beginning. Xu and Xun [11] outlined MapReduce model of distributed computing. In the instrument of MapReduce, they consolidated the structural planning attributes and key innovation of IoT. They led conveyed mining on information and data in the IoT world. Also, they represent stream information distribution. In a customary way for mining valuable information from raw data created by IoT, analyze deficiencies of the conventional Apriori calculation. Apriori has a lower mining proficiency and consumes up mass room in memory. The mining technique for monstrous information in the IoT involves stream information investigation, grouping and so on. They plan to propose a system for handling Big data with a low charge and apply security of information. The proposed system has a low effectiveness, so it should be moved forward. Wang et al. [12] investigated structural planning of the IoT in agribusiness that gives distributed processing and its usage. The execution planned on a two-tier construction using HBase. The structural planning gives constant read or access to the enormous sensor information. In addition, backing the sensor data executed by MapReduce model. XML documents put standards for the framework to bind the organizations of heterogeneous sensor data. Utilizing this framework lead the framework to lack of a variety of sensor data. Gole and Tidk [13] proposed a ClustBigFIM method, which is based on MapReduce structure for mining large datasets. ClustBigFIM is an improvement of BigFIM algorithm that is offering velocity to obtain information from massive datasets. They are relying on the manner of associations, sequential patterns, correlations, and other data mining missions that give good vision. MapReduce stage is utilized widely for mining big data from online networking as convention device and systems. It aims to employ frequent item to set mining calculation and MapReduce system on a flow of information. It can be consistent experiences in Big data. Li et al. [1] suggested a storage managing clarification relied on NoSQL, which is called IOTMDB. They offered a storing managing clarification to handle the massive and heterogeneous IoT data. The IOTMDB is not only mattered about how to save the massive IoT data successfully but also to concern for data distribution. The IoT data storing tactics are applied to incorporate a preprocessing procedure to cover the public and precise requirements. Their future work will be a model oriented to IOTMDB that will rely on NoSQL. In addition, they will handle and investigate the massive IoT data

11

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

to expand its value. Applying a reduction algorithm in the preprocessing step will improve the accuracy and save time. Mesiti and Valtolina [14] proposed a structure that was ready to assemble information from distinctive sources with diverse structures like JSON, XML, literary, and information gushing from sensors. The information accumulations led the database to be unstructured and oblige information joining. As the world moves to grow Big information investigation strategies, they came to an answer that can be loading information from heterogeneous sensors then incorporate that heterogeneous sensor information utilizing NoSQL frameworks. They outlined an easy to use loading framework by deciding an arrangement to choose fitting NoSQL framework that permits reasonable mapping to be conveyed. Zhan et al. [15] designed a massive data processing model in the cloud. Their model can be used to handle all types of data resources, which can be structured, semistructured, and non-structured. They concentrated on two main points. First, they outlined the CloudFS that depends on the open sources project Hadoop. Second, they implemented Cloud Manager DB that is constructed on the open sources project HBase, MongoDB. Finally, they did not provide any method to deal with the varieties of the data. Galache et al. [16] displayed the ClouT extend, which is a joint European-Japanese venture. Their main issue is making nations mindful of city resources. In addition, they talk care of these resources by a set of smart IoT services in the Cloud. The proposed framework based on a three-layer architecture, which is composed of CIaaS, CPaaS, and CSaaS layers. They developed different four use cases associated with different applications within four cities. These assets utilized and considered by effective IoT benefits in the Cloud. Sowe et al. [17] proposed an answer for massive heterogeneous sensor information issue. They obliged to make a join between distinctive types of information. This issue is an incorporated IoT structural planning. It consolidates a Service-Controlled Networking (SCN) as a key middleware to oversee heterogeneous information accumulated from sensors on a Big data cloud stage. The proposed model is connected to accumulate, share information, and control IoT social orders. It allows the client to investigate, find, and use the sensor information. They utilized the User Defined Harvester (UDH) advancements notwithstanding SCN to expand the included detection. In this paper, the portable detecting information is not accessible. They ought to execute the structure that can treat with this detection information. Cecchinel et al. [18] proposed a programming structure that ready to support big data examination work. This structure is the utilization measure of datasets that originate from physical sensors. These datasets originate from SMARTCAMPUS venture. Their structural engineering can understand genuine prerequisites from the SMARTCAMPUS venture. As a result, the work done in this structural planning relies on information from social event and capacity, i.e. the discriminating way of Big data accumulation stage by utilizing middleware structural engineering. They plan to create

12

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

a programming model that empowers every client to make its applications. On the off chance that they include new information, the system programming can be down. Therefore, the system adaptability ought to be improved. Mishra et al. [19] proposed a Cognitive-Oriented IoT Big-data framework (COIB) for the valuable data administration and knowledge detection over the IoT Big data. They built a general IoT Big data layered design by the usage of the COIB system in the huge scale mechanical computerization environment. They suggested in their future work to incorporate mining and examination huge information that is produced by trillions of IoT items In this paper, our proposed system offers a solution for storing and retrieving IoT Big data and improves the accuracy of the resulting data. The proposed system can store and retrieve a massive number of data in small time. First, we clean noise from data. Then, we use Kennard sample and SVD as a data reduction techniques to reduce big data from IoT without losing any data. Also, we use the mutual information algorithm to detect relationships between attributes and predict the semantic clusters. Finally, we use MapReduce based on FCM-DBSCAN for data clustering for the vast store and retrieve of data. 4. THE PROPOSED SYSTEM Storage Data with little size

Data Processing (clustering)

Homogenous data

Data Integration

Dimensional reduced data

Data Reduction Data Cleaning

Cleaned, noiseless data Storage

Raw data

Variety of sensors

Figure 8. The proposed system of the massive–heterogeneous sensor data.

13

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

The proposed system consists of two main phases: data preprocessing and data processing phases, as shown in figure 8 In the preprocessing phase, the first stage is data collection stage. In this stage, the dataset collected from different sensors. The second stage is data cleaning based on outlier detection and noise removal as it is easy to implement. The third stage is data reduction using SVD algorithm to reduce the dimensionality of the data and reduce the execution time of data processing then utilization of Kennard sampling to select a random sample from dataset aiming to save the running time. The last stage is data integration based on correlation and mutual information aiming to determine the relationship between attributes and detect semantic clusters. In the processing phase, data is clustered using FCM-DBSCAN based on MapReduce as it is a standard programming model for data distribution to improve the performance of big data in a vast time. In the following subsections, the main stages of these two phases will be discussed in details. a. Data Preprocessing Phase: The preprocessing is a basic phase in data science because it qualifies the choices to be originated from the qualified data. Data preprocessing is a data mining method that includes changing raw data into a reasonable information. Genuine information is frequently inadequate, conflicting, leaking in specific practices, and liable to include numerous mistakes. Data preprocessing is a demonstrated strategy for determining such issues. It utilizes database-driven applications, such as associations and standard established applications [20]. The applied data preprocessing steps are data cleaning, data reduction, and data integration. These steps are discussed in detail in the following subsections. a) Data Cleaning: The procedure of cleaning the data is not easy. The confusion may reach to more than 30% of real information that could be grimy. In addition, it has exceptionally cost [21]. Data can be cleaned based on procedures, such as filling in missing values, smoothing the noisy data, or solving the inconsistencies in the data. Several ways have been used to deal with missing data, such as [22]: 

Deletion: It removes the missing data and using the rest of the data in the analysis. This deletion can be inefficient as it decreases dataset size and may delete valuable data.



Imputation: It tries to fill in the missing values with the help of many techniques, such as: o Mean/Mode: It fills the missing data by using the mean of a numeric attribute or mode for a nominal attribute of all data. o K-Nearest Neighbor Imputation (KNN): It uses KNN algorithms to fill the missing data. It can deal with discrete and continuous attributes. KNN searches all the data to find the most similar instances. It can choose the most probable value from the dataset.

We suggest KNN algorithm for data cleaning.

14

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

Input data

De-duplication

Detect outlier

Replace missing values

Filtering

Figure 9. The block diagram of the data cleaning steps. Figure 9 shows that the input data has many challenges as noisy data and outliers. First, the data is de-duplicated to remove the repetition of the data. Then, the outlier is detected and excluded from data. The data is filtered to choose the principle attributes to represent data. Finally, missing values are replaced by the most probable value depending on KNN regression. b) Data Reduction A monstrous measure of information is progressively available from different sources, for example, logistics insights, cameras, receivers, RFIDs, scanner tag information, remote sensor systems, and account logs from R&D [23]. Highdimensional information gets extraordinary difficulties terms of computational manysided quality and characterization execution. Along these lines, it is important to obtain a low-dimensional component space from high dimensional component space to outline a learner with great execution [24].

Input data (Cleaned and Noiseless data)

Dimensionality Reduction

Singular Value Decomposition

Reduced data

Sampling

Numericity Reduction

Figure 10. The block diagram for the data reduction steps. Figure 10 shows that the cleaned data is the input for data reduction stage. Data reduction separated to numericity reduction and dimensionality reduction. The data numericity reduction can be applied using regression or sampling. The used sampling algorithm is Kennard sample. Kennard sample reduces the number of iterations by viewing a list of the highest smallest distances that aims to save time. The data dimensionality reduction can be applied using many algorithms as PCA, SOM, and SVD algorithm. We proposed to use SVD for dimensionality reduction. It is suitable for reducing the dimensionality of large dimensional data. We compare SVD

15

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

algorithm with other algorithms as PCA, SOM, ICA, PCA (Kernel). We conclude that the SVD algorithm operates in small time than other algorithms. c) Data Integration Data with diverse forms are put together and opposes with each other. Data integration is a successful way to deal with combined data that lives in various sources and gives brought together for the purpose of access to the end users [25]. Big data presents organizations with huge volume and multifaceted nature. It comes in diverse structures–organized, semi-organized, and unorganized– and from any number of different sources. From this immense, various sorts of information are continuously developed. Therefore, the organizations should concentrate on speedy, exact, and significant bits of knowledge [26]. The proposed algorithm is the mutual information that can able to deal with numeric data. Mutual information detects the relationship between the attributes and also detects the semantic clusters. The equation of the mutual information is as follows [27]: MI=∑x,y P(X,Y)log2[P(X,Y)/P(X)P(Y)]

(10)

where X and Y are the two dimensions of the dataset. b. Data Processing Phase Data processing phase is the control of the information processing. Information preparation refers to the handling of the information that is needed to run associations [28]. Massive data from IoT require processing for data storing. Huge IoT information is the high inspecting recurrence, this result in a tremendous measure of repeating or amazingly comparative qualities. We suggest MapReduce based on a hybrid of FCM and DBSCAN as a clustering algorithm to overcome the massive data storing problem. MapReduce is considered as the most suitable technique to apply massive data processing. In FCM-DBSCAN Map function, first, we initialize minimum points that represent minimum value of points in each cluster, epsilon value that represent the distance between center and point, and membership matrix. Then, we calculate the centers of clusters using equation for each point in the dataset, the distance between points and center of the cluster is calculated using equation d=∑Ni=1 ∑Cj=1 Mmij ||xi – cj ||2. If the distance between point and center of cluster equal or greater than epsilon value, this point marked as neighborPts to this cluster. Then, the neighbors points for each center are calculated depending on epsilon value. If neighbor points for any cluster are less than minimum points, then mark point as a noise else, the point marked as clustered. We determine the key and create a new cluster. It repeats until reach to convergence state. Finally, emit each point and each belonging cluster.  FCM-DBSCAN Map Function

16

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

FCM-DBSCAN Map function FCM-DBSCAN(D, eps, MinPts, M) Initialize a number of clusters. For each point P in dataset D if P is visited continue next point mark P as visited Calculate the center of clusters by equation Calculate distance using equation d=∑Ni=1 ∑Cj=1 Mmij ||xi – cj ||2. For each p in D Calculate neighborPts for each c based on eps If d= MinPts set NeighborPts equal NeighborPts ∪ NeighborPts′ End if If P′ is not yet a member of any cluster add P′ to cluster C End if End for End for Output: Set of clusters of data.

In FCM-DBSCAN Reduce function, the inputs are minimum points, epsilon value, clusters, and keys. For each C cluster, the final cluster points equal to previous cluster points in addition to the current cluster points. For all points in the cluster if a point

17

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

marked as unvisited, then this point is marked as visited. The neighbor points are calculated and compared with minimum points. If neighbor points are greater or equal to a minimum point, then neighbor points are equal to neighbor points and cluster points. Finally, the output is a set of a cluster of data. As shown in figure 8, raw data is collected from different sensors, which results in many problems, such as noisy, heterogeneous and massive data. Our proposed work aim to solve these problems that face sensor data. The raw data is collected from different sensors and stored. Then we applied preprocessing on this data. Then this data is cleaned from noise by regression using KNN. We suggest KNN for dealing with noisy data as it is very simple and can detect the most probable value than another technique. Then, the cleared data reduced using SVD algorithm. It is very suitable for reducing the high-dimensional data and for validating significant vision of data. Therefore, data is sampled using Kennard sample. When applying the sampling, it speeds the running time. We integrate the data come from heterogeneous sources based on correlation, covariance matrices using mutual information matrix to detect the relationship between elements in the dataset and predict semantic clusters. In the data processing step, the proposed model is the MapReduce model based on FCM-DBSCAN clustering technique. It is an intensity established clustering algorithm, which gives an arrangement of entities in some space. It can discover the clusters of diverse forms and sizes from a huge quantity of data without detecting some clusters in the beginning. 5. THE EXPERIMENTAL RESULTS AND DISCUSSION 5.1 DATASET DESCRIPTION The dataset includes ordinary IADL housekeeping activities [29]. These activities are vacuuming, ironing, dusting, brooming, mopping, cleaning windows, making the bed, watering plants, washing dishes, and setting the table. The general interval of the dataset is 240 minutes. The intervals differ amongst some of the activities, indicating the usual spreading of activities in daily life. They used the Porcupine sensor together with the iBracelet to record both acceleration and RFID tag detections. The dataset consists of the estimation of 1048576 records. We implement the proposed technique on the dataset using Radoop, KNIME, and Matlab 2015b on Core(TM) 2 Due, 2 GH processor, and 3 GB RAM. 5.2 RESULTS VIEW

18

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

Figure 11. A part of the used dataset. Figure 11 shows a part of the used dataset. The act represents activity label results from ironing, vacuuming, brooming, making the bed, mopping, window cleaning, watering plant, dish washing, and setting the table. Acc represent 3D Accelerator [x, y, z] represented in Acc1, Acc2, Acc3, Lgt represent light, Tlt represent nine tilt data, Btn represent annotation buttons, Rtc represent real time clock [ddmmyyhhmmss], and Time represents elapsed number of seconds from the beginning of the recording.

19

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

Figure 12. The outlier detection. Figure 12 shows the outlier detection. A new field called outlier appears. In the state of finding an outlier, the value of this field is true otherwise the value is false. The outlier is true when an observation is well outside of the expected scope of values in an experiment. An outlier arises from variability in the measurement or experimental error indication. The outliers are excluded from the dataset.

Figure 13. The outlier excluding and replacing the missing values. Figure 13 shows that the outlier property has the values false for all the tuples, and the missing values are replaced by the most probable value depending on KNN regression.

Figure14. The SVD deployment. Figure 14 shows the applying of SVD algorithm that results in the reduction of the dataset. The data represented using a smaller number of properties. The attribute with high singular value has the priority to be presented. SVD1 has the highest probability to present the data.

20

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

Figure 15. The mutual information matrix. Figure 15 shows the outcome matrix from mutual information. A measure of the variables mutual dependence is the trans-information of two variables. The mutual information represents the rate of association or correlation between the row and column variables. The mutual information partitioned data by 2N where N is the sample size. The mutual information between items is used as a feature for clustering to discover semantic clusters. When the value of mutual information is large, it represents a high relationship between attributes. 5.3 RESULT VIEW OF MAPREDUCE PROCESSING.

Figure 16. The resulting attributes from Read dataset code.

21

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

Figure 17. The MapReduce function execution and read the resulted data after MapReduce implementation. Figure 16 shows the read data from MapReduce that observes a set of resulted attributes view from IADL dataset after data preprocessing phase. Figure 17 shows the MapReduce implementation. The dataset begins with no Map and no Reduce (Map is 0%, and Reduce is 0%) until Map becomes 100% and Reduce becomes 100%. Then, we read the data result from MapReduce implementation. 5.4 EVALUATION The evaluation observes the time and accuracy of preprocessing of the dataset. As shown in Tables 2 and 3, the precision value is 99.3%, sensitivity value is 99.53%, and the value of specificity is 85.52%. From the previous results and evaluation, we conclude that the reduction step and FCM-DBSCAN enhanced the accuracy of the Big data to be 98.9%. Accuracy =

(11)

Precision =

(12)

Sensitivity (TP rate) =

(13)

Specificity (TN rate) =

(14)

TP True Positives: positive tuples correctly labeled FP False Positives: negative tuples incorrectly labeled TN True Negatives: negative tuples correctly labeled FN False Negatives: positive tuples incorrectly labeled

22

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

Table 1: The comparison between KM, Optics, EM, DBSCAN, and our proposed system FCM-DBSCAN based on MapReduce model. Performance measure

PCA Accuracy Time (%) (Sec.)

PCA(Kernel) Accuracy Time (%) (Sec.)

ICA Accuracy Time (%) (Sec.)

SOM Accuracy Time (%) (Sec.)

SVD Accuracy Time (%) (Sec.)

k-Means

92

0.2

0.89

2

87

0.2

92.15

0.1

94.73

0.2

Optics

90.2

1.9

91

9.68

65.48

0.6

91.05

1.9

90

0.79

EM

65.82

13

75.28

2.17

94.4

2.54

66.64

8

95.21

2

DBSCAN

93.4

0.3

89.3

7.3

90.12

0.4

88.46

1

98

3.11

FCMDBSCAN

94.5

0.25

91.6

5.2

97.48

0.5

93.5

2.3

98.9

1.5

Table 1shows the comparison between different clustering algorithms as K-Means, Optics, EM, DBSCAN, and the proposed approach FCM-DBSCAN. The clustering algorithms are tested with different data reduction algorithms, such as PCA, PCA(Kernel), ICA, SOM, and SVD. In table 2 and table 3 we divided the dataset to training data and testing data, then we evaluate the proposed approach on the tested data. Table 2: The Positive and Negative matrix for the proposed system. Predicted

True

False

Yes

9500

44

No

66

390

Actual

Table 3: The performance measure of our proposed system. Recall

99.53%

Precision

99.3%

Sensitivity

99.53%

Specificity

85.52%

Accuracy

98.9%

F-measure

99.39%

23

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

14 13 12 11 10 9

PCA

8

PCA kernel

7

Time in seconds

6

ICA

5

SOM

4

SVD

3 2 1 0 K-Means

Optics

EM

DBSCAN

FCM-DBSCAN

Clustering Techniques

Figure 18. The expended time comparison between different clustering technique based on different reduction algorithms based on MapReduce model. From our comparative studies that done in Table 1 and figure 18, we found that FCMDBSCAN with its varied approaches for data reduction had the high accuracy value. FCM-DBSCAN with SVD have the highest value of accuracy and retrieve data in a small time. K-Means and optics have nearest accuracy value, but optics has longer time. The EM algorithm takes larger time than other techniques. The DBSCAN has high accuracy but takes longer time. In FCM-DBSCAN, the accuracy increased and the expected time decreased. 6.CONCLUSION A massive amount of IoT data has been generated due to the vast increasing of existing devices, sensors, actuators, and network communications. The resulting massive IoT data is called "Big data." Big data refers to a massive data, which takes much time to be processed. Therefore, we focused on clustering methodology rely on MapReduce model to store data and recover results in a close real-time. We offer a framework for processing massive and heterogeneous data in IoT. This paper illustrated the Big data from IoT from many viewpoints. The raw dataset is collected from different sensors, which leads to many problems, such as noisy, heterogeneous, and massive data. Our proposed system aims to solve these problems that face sensor data. The architecture of the proposed system consists of two main

24

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

stages: data preprocessing and data processing phases. In the preprocessing phase, we used KNN to clean noisy data and replace missing data, which can use the most probable value. The SVD is used to reduce data to save time. The mutual information is implemented to detect the relationship between the data and detect semantic clustering to achieve high accuracy and speed the running time. The MapReduce model based on FCM-DBSCAN achieves data clustering by Map and Reduce functions in a small time that resulted from using reduction technique before data clustering. The processing time of the proposed system is 1.5 seconds, and the accuracy is 98.9%. In future work, we will implement the processing on different datasets and apply different techniques using spark model that aims to speed the running time. Moreover, we will implement data query processing using the best and suitable model of NoSQL database. We suggest Key-value database. The Key-value (KV) stores use the associative array, which is called a map. This approach can efficiently retrieve selective key ranges. Also, we will address the challenges and deeply develop the big data processing in cloud computing environments. 7. REFERENCES [1] Li, T., Liu, Y., Tian, Y., Shen,S., & Mao, W. (2012). A storage solution for massive IoT data based on NoSQL. IEEE International Conference on Green Computing and Communications (GreenCom), Besancon, 50-57. [2] Tsai, C., Lai, C., Chiang, M., & Yang, L. (2014). Data Mining for Internet of Things: A Survey. IEEE Communications Surveys & Tutorials, 16(1), 77-97. [3] Sharma, S. & Mangat, V. (2015). Technology and trends to handle big data: a survey. 5th IEEE International Conference on Advanced Computing & Communication Technologies (ACCT), Haryana, 266-271. [4] Martha, V. S., Zhao, W., & Xu, X. (2013). h-MapReduce: a framework for workload balancing in MapReduce. 27th IEEE International Conference on Advanced Information Networking and Applications, 637-644. [5] Dharni, C. & Bnasal, M. (2013). An Improvement of DBSCAN Algorithm to Analyze Cluster for Large Datasets. IEEE International Conference on MOOC, Innovation and Technology in Education (MITE), 42-46. [6] Ghosh, S. & Kumar, S. (2013). Comparative Analysis of K-Means and Fuzzy CMeans Algorithms. International Journal Of Advanced Computer Science And Applications, 4(4), 35-39. [7] Bora, D. & Gupta, D. (2014). A Comparative study Between Fuzzy Clustering Algorithm and Hard Clustering Algorithm. International Journal Of Computer Trends And Technology, 10(2), 108-113. [8] Han, J., Kamber, M., & Pei, J. (2012). Data Mining Concepts and Techniques. Third Edition, Elsevier, Chapter 9, 422- 425.

25

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

[9] Singular Value Decomposition (SVD) tutorial. (2016). Web.mit.edu. Retrieved 7 Jan 2016, from http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm [10] Tao, X. & Ji, C. (2014). Clustering massive small data for IOT. 2nd International Conference on Systems and Informatics( ICSAI), Shanghai, 974978. [11] Liancheng, X. & Jiao, X. (2014). Research on distributed data stream mining in Internet of Things. International Conference On Logistics Engineering Management And Computer Science (LEMCS), Atlantis Press, 149- 154. [12] Wang, H., Lin, G., Wang, J., Gao, W., Chen, Y., & Duan, Q. (2014). Management of Big Data in the Internet of Things in Agriculture Based on Cloud Computing. AMM, 548-549, 1438-1444. [13] Gole, S. & Tidke,B. (2015). Frequent itemset mining for big data in social media using ClustBigFIM algorithm. IEEE International Conference on Pervasive Computing (ICPC), Pune, 1-6. [14] Mesiti, M.& Valtolina, S. (2014). Towards a user-friendly loading system for the analysis of big data in The Internet Of Things. 38Th IEEE Annual International Computers, Software, And Applications Conference Workshops (COMPSACW), Vasteras, 312- 317. [15] Zhang, G., Li,C., Zhang, Y., Xing, C., & Yang, J. (2012). An efficient massive data processing model in the Cloud- A preliminary report. 7th ChinaGrid Annual Conference, Beijing, 148-155. [16] Galache, J., Yonezawa, T., Gurgen, L., Pavia, D., Grella, M., & Maeomichi, H. (2014). ClouT: leveraging cloud computing techniques for improving management of massive IoT data. 7th IEEE International Conference on ServiceOriented Computing and Applications(SOCA), Matsue, 24-327. [17] Sowe, S., Kimata, T., Dong, M., & Zettsu, K. (2014). Managing heterogeneous sensor data on a big data platform: IoT services for data-intensive science. 38Th IEEE Annual International Computers, Software, And Applications Conference Workshops, Vasteras, 259-300. [18] Cecchinel, C., Jimenez, M., Mosser, S., & Riveill, M. (2014). An architecture to support the collection of big data in The Internet Of Things. 10Th IEEE World Congress On Services, Anchorage, AK, 442-449. [19] Mishra, N., Lin, C., & Chang, H. (2014). A cognitive-oriented framework for IoT big-data management prospective. IEEE International Conference Communication Problem-Solving (ICCP), Beijing, 124-127. [20] What is Data Preprocessing? - Definition from Techopedia. (2015). Techopedia. com. Retrieved 9 July 2015, from http://www.techopedia.com/definition/14650/data-preprocessing

26

ISSN:0254-0223

Vol. 31 (n. 7, 2016)

[21] Tang, N. (2015). Big RDF data cleaning. 31st IEEE International Conference on Data Engineering Workshops (ICDEW), Seoul, 77-79 . [22] Shoaip, N., Elmogy, M., Riad, A., & Badria, F. (2015). Missing Data Treatment Using Interval-valued Fuzzy Rough Sets with SVM. International Journal of Advancements in Computing Technology(IJACT), 7(5), 37-48. [23] Sadeghzadeh, K. & Fard, N. (2015). Nonparametric data reduction approach for large-scale survival data analysis. IEEE Reliability and Maintainability Symposium (RAMS), Palm Harbor, 1 – 6. [24] Katole, S. & Karmore, S. (2015). A new approach of microarray data dimension reduction for medical applications. 2nd IEEE International Conference on Electronics and Communication Systems (ICECS), Coimbatore, 409-413. [25] Saranya, K., Hema, M., & Chandramathi, S. (2014). Data fusion in ontology based data integration. IEEE International Conference on Information Communication and Embedded Systems (ICICES), Chennai, Tamil Nadu, India, 1-6. [26] Pal, K. (2015). How to Address Common Big Data Pain Points. Data Informed. Retrieved 8 July 2015, from http://data-informed.com/how-to-address-commonbig-data-pain-points [27] Cover, T. & Thomas, J. (2012). Elements of information theory. Second Edition, John Wiley & Sons, Chapter 2, 19-22 . [28] Encyclopedia Britannica: data processing | computer science. (2015). Encyclopedia Britannica. Retrieved 7 July 2015, from http://www.britannica.com/technology/data-processing [29] ADL Recognition Based on the Combination of RFID and Accelerometer Sensing | Embedded Sensing Systems - www.ess.tu-darmstadt.de. (2015). Ess.tudarmstadt.de. Retrieved 17 August 2015, from http://www.ess.tudarmstadt.de/datasets/PHealth08-ADL

27

a big data processing framework based on ...

a big data processing framework based on ...

Suggest Documents

Big Data Pre-Processing: A Quality Framework

A Cloud-based Framework for Evaluation on Big Data

Data Processing on Big Data sets

A Big Spatial Data Processing Framework Applying to ...

Metasynthesis-Based Intelligent Big Data Processing ...

Towards an Offloading Framework based on Big Data ... - ScienceDirect

On Big Data Stream Processing - eLibrary

Real-Time Big Data Processing Framework - Natural Sciences ...

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Big Data Analysis Framework for Model-Based Web ...

Real-Time Big Data Processing Framework - Natural Sciences ...

Machine Learning and Big Data Processing: A

Machine Learning and Big Data Processing: A

Big Data Processing Stacks

Big Data Processing Stacks

Big data processing using Open Source Software- A Questionnaire on ...

A Research on Machine Learning Methods for Big Data Processing ...

A Survey on Geographically Distributed Big-Data Processing ... - arXiv

A Survey on Geographically Distributed Big-Data Processing ... - arXiv

NebulOS: A Big Data Framework for Astrophysics

ContextXML: A Data Processing Framework for ...

Exploration on Big Data Oriented Data Analyzing and Processing ...

Capitalizing on Big Data: Toward a Policy Framework for Advancing ...