Table 1. Trends of Big Data Analytics. 1- Strong-to-Moderate. Commitment, Strong. Potential Growth. â¢. Advanced Analytics. â¢. Advanced Data Visualization.
Big Data Management: Project and Open Issues Hassan Keshavarz, Ph.D Candidate, Malaysia-Japan International Institute of Technology (MJIIT), Universiti Teknologi Malaysia(UTM) Wan Haslina Hassan, Malaysia-Japan International Institute of Technology (MJIIT), Universiti Teknologi Malaysia(UTM) Shozo Komaki, Malaysia-Japan International Institute of Technology (MJIIT), Universiti Teknologi Malaysia(UTM) Naoki Ohshima, Malaysia-Japan International Institute of Technology (MJIIT), Universiti Teknologi Malaysia(UTM)
Abstract In the future, projects will be involved bombing of the huge amount of data related to a variety of fields. The existing data is exponentially growing up in different domains in the world in such a way that is not controllable. This phenomenon is the sign showing that the era for Big Data is arrived. It is plausible that data management and data analysis are the ultimate way to extract the invaluable information from these massive data sets. However, producing high accurate outcomes from these enormous data needs noticeable efforts, expense, and time. Hence, this issue necessitates paying more attention to Big Data phenomenon. Nowadays, researchers try to investigate the impacts of Big Data in variety of domains such as management, business, communication, and so forth. Also, they try to identify the Big Data opportunities and challenges. In this study, we define Big Data subject, explain its major challenges, and discuss about some algorithms in data mining for data clustering in management science. Moreover, we outline the open issues that help in identifying new research directions in Big Data management. In conclusion, this study tries to provide a solution for the Big Data management by means of the existing data mining algorithms.
Key Words & Phrases — Big Data, Big Data Management, Big Data Analytics
1. Introduction We live in the data age. It’s not easy to measure the total volume of data stored electronically, but according to IDC’s 2011 Digital Universe Study commissioned by EMC, it has surpassed 1.8 zettabytes in 2011 [1,2]. A zettabyte is 1021 bytes, which is roughly the same order of magnitude as one disk drive for every person in the world . This flood of data is coming from different sources such as Stock Exchanges, Social Networks, Machine logs, RFID readers, sensor networks, vehicle GPS traces, and Astronomy [3]. Therefore, with the fast growing up the data and the business environment, it seems that data analysis and mining is the ultimate way to extract the invaluable information from these massive data sets. That is, obtaining analysis results quickly represents a desirable key for ‘Big Data Analytics’ [4]. Recent report [5] categorized that most Big Data analytics options will experience some level of growth in the near future. Four groups of options stand out based on combinations of growth and commitment and they are as follow: Strong-to-moderate commitment, strong potential growth Moderate commitment, good potential growth. Weak commitment, good growth. Strong commitment, flat or declining growth. In particular Table 1 reveals the Big Data analytics trends in fourth groups. Table 1. Trends of Big Data Analytics 1- Strong-to-Moderate Commitment, Strong Potential Growth
Advanced Analytics Advanced Data Visualization Predictive analytics Real-Time Dashboards Text Mining In-Memory Database Visual Discovery
2- Moderate Commitment, Good Potential Growth
3- Weak Commitment, Good Growth.
4- Strong Commitment, Flat or Declining Growth.
Analytics in Enterprise Data Warehouse (EDW) Analytics in Database Data mining In-Database Analytics Data Warehouse Application DBMS Built for Data Warehouse Private Cloud Mixed Workloads Extreme SQL SaaS Hadoop MapReduce Inline Analytics No-AQL DBMS Public Cloud Data Marts for Analytics Central EDW Statistical Analysis OLAP Tools Hand-Coded SQL DBMS Built for OLAP
Considering the above trends, the demand for large-scale data mining computational and storage resources is dramatically growing up which has driven many [6,7] to develop new parallel and distributed data mining systems targeted at individual models and applications. However, methods and tools for the computation of accurate results are currently not working efficiently on Big Data [8]. The common way of analyzing these large datasets is to perform sampling and computing approximate but fast and rather accurate results from such samples. Unfortunately, the existing approaches for early results are currently not supported in MapReduceoriented systems [9]. The rest of the paper is organized as follows. In Section 2, we categorize Big Data in several types, and review the existing data processing technologies for each type. In section 3, we describe the charactristics of Big Data. In Section 4, we summarize the some algorithms for clustering in data mining field. In Section 5 , we enumerate some research opportunities and challenges for Big Data project management. Finally, Section 6 concludes the paper.
2. What Is Big Data? Size of the stored data is basically what defines how Big Data is. However, other factors like velocity and variety of the data are also important to characterize the Big Data. More precisely, the three Vs (volume, variety, and velocity) introduce a comprehensive definition of the Big Data, which refuse it to be merely defined based on the volume of the data. Moreover, each of these three Vs has its own specific divergence for analytics. There is a widely held belief among scientists that “more data beats better algorithm”. Big Data has given this opportunity to detect and analyze insight rooted in datasets. However, there are some challenges within this field such as computational cost and storage. Velocity, volume, and variability in this field requires to better analytics [10].
Big Data volume: Nowadays, the main challenge exposed to discussion by data volume is eminent in the age of Internet of Things (IoT). Potentially valuable data source in science is biology, meteorology, astronomy, and so forth. In realm of the Internet, … such as Google, social networks (Facebook, Twitter, LinkedIn, and MySpace) sensor networks and etc. other sources for big-volume can be found in other realm such as finance, business, economic, industry, manufacturing and personal lives. Due to the various big-scale issues, there is no universal agreement on the Big Data definition which is depend on some important parameters. Since, much of this data is iterative
and sometimes no interest, propose a new model in order to filtered and compressed data for magnitude is important challenge in Big Data management perspective. Due to the lack of appropriate model in many organizations, there isn’t a suitable way to optimize the Big Data volume to make better decisions. The problem of volume can be addressed by incremental data mining algorithms sampling, processing data in batches, and development of parallel algorithms.
Big Data Variety: Second important factor is complexity of the data structure. This is an undeniable fact that the internetproduced data in the global village is going to explode. So, another challenge is how we can store, capture, manage, and process the data within a right time in order to apply it as an asset? Nowadays, everyone are able to divide and generate different classification and taxonomy of Big Data. As one of the main properties of Bid Data is Heterogeneity [11] so, most of them are divided into two categories, unstructured and semi-structure.
Big Data Velocity: The speed challenge is along with the necessity to manage the rate with which new data is generated or current records are updated. This matter is mostly used in data produced by machines for example those produced by detecting or mobile devices which are being positioned in different places. In such applications, huge amount of new and reorganized data enters into the systems whereas we need the systems to transform the data immediately after its formation. Nowadays, speed of data has made a lot of challenges to data management and administration issues. Together, storage and query processing level should be fast and measurable. To manage high speed of data, the knowledge of data flowing has been studied for decades. Though, the present streaming systems have limited capacity particularly when encountering with the growing capacity of new data generated in today’s sensor network, telecommunication system, etc.
3. Big Data Characteristics: HACE Theorem HACE Theorem Big Data is acknowledged to have large-volume, Heterogeneous, Autonomous sources with a degree of control distributed and decentralized , one which is set to probe into Complex and Evolving data relationships. These characteristics have posed some great challenge for anyone to dig some useful knowledge from the Big Data. It is an age-old analogy that there is a number of blind men attempting to measure a giant elephant- the large animal will be the Big Data in this context. Each of the men intends to draw a picture (or make conclusion) of the elephant according to the part of information that they have inferred during the process. Because each person’s view is constrained by his local region, it is not surprising that the blind men will in his own way, conclude that the elephant “feels” like a rope, a hose, or a wall, depending on the region where each of them men is located. To make the problem more complex, we can assume that (a) the elephant grows rather fast and its pose also changes at constant rate, and (b) the blind men also learn from each other as they produce some self-derived information on the elephant. The exploration of the Big Data in this scenario is similar to grading heterogeneous information from various sources (blind men) to help draw the best possible picture in revealing the genuine gesture of the elephant in a real-time fashion. Indeed, this task does not work as easily as asking each blind man to express his feelings about the elephant and then to have an expert draw one single picture with a merging view, concerning the fact that each individual may speak a different language (heterogeneous and diverse information sources) and theer are also concerns on the aspect of privacy about the messages they obtain in the information exchange process.
4. Cluster Analysis Algorithms Some of the data mining algorithms in this section are apparently the most significant in the field of Data Mining. This study seeks to bring to light some well-established algorithms and elaborate on the effect the algorithms leave on Big Data Management. Algorithms which shall be discussed well into statistical learning,
classification, association analysis, clustering, and link mining that are said as prominent due to them being the most important subjects of development and research in data mining. Implementation and execution of large projects are of high risks due to unforeseen elements. The risks that require necessary steps and anticipation by means of management tools, being in line with the project management. As a project is faced with a high volume of data, the attempt of the researchers in this section is to use algorithms of Data Mining, manage the present data and then interpret the collected data using the existing algorithms such as K-Means, KNN, The EM algorithm, PageRank algorithm, AdaBoost algorithm, Naive Bayes, CART, Support vector machines, and the Apriori algorithm, etc., in order to reduce cost, time and risk and also to increase the project quality based on the results obtained. Time, cost and quality of each project are of the principal aims of each project. In recent years, the demand for project stakeholders based on reducing costs along with reducing total project time and enhancing the quality of the project has intensified. This issue encourage the researchers to use algorithms in Data Mining and extracting useful information at the project management.
The k-means algorithm The k-mean algorithm is a simple repetitive method used to divide certain dataset into a user-defined number of clusters. Several scholars have bought to attention this algorithm which universality extends various disciplines, particularly Lloyd(1957,1982) [12], Friendman and Rubin (1967) [13], and McQueen (1967)[ 14]. A well-described history of k-means and the accompanying explanations of different variations are explained in [15]. Gary and Neuhoff [16] elaborate a thorough historical background for k-means in the bigger domain of hill-climbing algorithms. The algorithm works on a number of d-dimensional vectors, D={Xi|i=1…N}, where Rd denotes the i-th data point. The algorithm starts by choosing k points in Rd as the primary k-cluster representatives or “centroids”. Techniques for choosing these initial seeds contain sampling done randomly from the dataset, setting them as the resolution of grouping a small subset data or perturbing the global mean of the data k-times. Then the algorithm start to repeat from two steps until it converges: Step 1: Data Assignment. Each data point is allocated to its nearest centered item, with connection broken arbitrarily. The outcome of this step would be partitioning of the data. Step 2: Relocation of “means”. Considering all assigned data points, each group representative is relocated to the center (mean). If there is any probability measure (weights) for data points, then the relocation will be the expectations (weighted means) of the data partitions. The algorithm start to converge at the time assignments (and hence the cj values) do not change anymore. It should be noted that each iteration requires N×K comparisons, which can influence time complexity for one iteration. N can define the number of required iterations for convergence, but as a first cut, in terms of dataset size this algorithm may be as a linear algorithm. One issue to discuss is to measure the “nearest” in the assignment step. The primary measurement of closeness is Euclidean distance, one can readily show that the non-negative cost function will lessen whenever there is a modification in those two steps, and hence convergence is assured in a limited number of repetitions.
The Apriori algorithm One of the most popular techniques in data mining is to detect frequent item sets from a transaction dataset and obtain association rules. Detecting common itemsets (item sets with frequency bigger or equal to a user specified minimum support) is not trivial because of its combined explosion. Once frequent item sets are acquired, it will be easy to generate association rules with confidence bigger or equal to a user specified minimum confidence. Apriori is an algorithm in order to detect frequent item sets using candidate generation [17]. It is considered as a level-wise thorough search algorithm using anti-monotonicity of item sets, “if an item set is not frequent, any of its superset is never frequent”. Apriori considers that items in a transaction or item set are ordered in lexicographic order. Let the set of frequent item sets of size k be Fk and their candidatures be Ck. At first it searches the database in order to find item sets of size 1 by adding a number to the counter for each item and accumulate those which fulfill minimum support requirement. Then it continues and follows with three steps and extracts all the frequent item sets. By reducing the size of candidature sets, this algorithm reaches a good performance. However, in cases where item sets come in abundance and come frequently, in large item sets, or very low minimum support, it still has the problem of cost of generating large number of candidate sets
iteratively, in order to examine a large set of candidate item sets. As a matter of fact, it rather crucial to generate 2100candidate item sets to acquire frequent item sets of size 100.
kNN:k-Nearest Neighbour Classification Rote classifier is one of the simplest and even seen as somewhat trivial classifiers which brings into memory the complete training data and conducts classification if the test object properties match accurately one of the training examples . a straightforward flaw of this method lies in the fact that many test records will go unclassified as they do not really match any of the training records. A more complicated method named the kNearest Neighbor (kNN) classification [18], looks up a k objects group in the training set that are nearest to the test object, and bases the allocation of a label on the predominance of a certain class in this vicinity. The three main factors of this method are : a group of labeled objects, e.g. a group of stored records, a similarity or distance metric to calculate distance among objects, the k value, and the number of closest neighbors. To make classification of an unlabeled object, the distance of this object to the labeled objects is calculated, its k-nearest neighbors are detected, and the class labels of these closest neighbors are then utilized for the determination of the class label of the object.
5.
Big Data Challenges and Opportunities
Big Data Integration Given the heterogeneity of the data flood, data recording and data being thrown into a repository may not be sufficient. If there is a group of data placed in a repository, it is not easy for people to find out, let alone reuse the data. Despite hopes garnered for enough metadata, challenges continue to be posed with regards to the different experimental details and the different record structure. Data analysis remains to be a more challenging task than any other processes involving data. On a larger scale, these processes have to be done automatically. The varied data structure and semantics need to be readable by the computer and then can be resolved robotically. Some sound scholarly work in the area of data integration can well offer solutions, but additional work needs to be done to attain the resolution of differences that is automated and more importantly, error-free.
Data Mining and Analysis Mining necessitates integrated, consolidated, dependable ,cleaned, and accessible data, declarative query, scalable mining algorithms, mining interfaces, and large scale-data computing environments. Simultaneously, data mining can contribute ameliorate the quality and integrity of the data, grasp the data semantics, and provision intelligent querying functions. As mentioned before, medical records of real-life contain errors, are heterogeneous, and are often disseminated in multiple systems. The Big Data analysis value can be fulfilled if it can be applicable reliably under tough conditions. On the opposite side, data-developed knowledge can assist in error-correction and removing ambiguity. For example, the term “DVT” usually referring to “deep veinthrombosis” and “diverticulitis,” which are two various medical situations. A knowledge-base created from pertaining data can utilize relevant symptoms or medications in order to make certain which term the physician implies. Lack of coordination between database systems is the current Big Data analysis problem, which host the data and provision SQL querying, with analytics packages that conduct different forms of non-SQL processing, like data mining and statistical analysis. Today’s analysts are hindered by a boring process of exporting data from the database, conducting a non-SQL process and retrieving the data. This is a barrier to carrying over the mutual delicacy of the first generation of SQL-based OLAP applications into the data mining analysis which is in growing demand. A tight link between the declarative query languages and the tasks of such packages will bring some advantages to both the analysis’ expressiveness and performance.
Big Data for Risk Management The risk management in industry is capable of make profit from the application of Big Data. If the governments and companies had strong funding strategies, and bank loan officers had been more invested in data-driven issues, we would probably have a much advanced world at present. For instance, if the health care industry
administrators had better awareness and proficiency to evaluate millions of health care dealings daily, they could probably better recognise underlying risk factors of diseases and also untrue claims which helps in reducing morbidity and mortality, and preventing billion health care costs and loss annually [16]. Most companies’ emphasis is on powerfully managing financial information to ensure the precision and reliability of their financial data and protect their Chief Financial Officer (CFO) and Chief Executive Officer (CEO). Although growing examinations affected by such controllers and compliance mandates as Sarbanes-Oxley, Basel II, Solvency II, and Health Insurance Portability and Accountability Act (HIPPA) have resulted in growing charges and duties to the Chief Information Officer (CIO) union for data value and authority, the information ownership and therefore responsibility for its reliability is required to be continued with the business.
Project Cost Management As far as the hardware prices are concerned, it is true that the hardware storage has reasonable price and the cost continues to decline but in actuality, a lot of other data center operation expenses, which comprise of the Total Cost Of Ownership (TCO) have been sadly ignored. Other costs (for instance energy and maintenance) are also shown to increase in direct proportion to the seemingly unstoppable data growth. Across industries, lowering costs following operational proficiencies perfection is of attention and interest to business managers. As stated above, in any business, Big Data plays an effective role through constant monitoring and investigation of data produced by sophisticated machines, inserted environmental sensors, and operational metrics. This Machine-to-Machine (M2M) sensor data, which traditionally was thrown away because of uncontrollable bulks or rate at which it gets produced, has potential to increase functioning productivity of equipment and also general operational security. Investigating this type of M2M data is valuable to calculate the condition and quality of equipment constituents and for appropriate and sensible protective maintenance, thus preventing expensive industrial interruption. Consequently, sufficiently managing such Big Data contributes to productivity and competence of processes.
Project Communication Management: Another parameter in project management context is project communication management. It is the provider of a communication between people and important information that are required for effective and successful communication. Collected information through the relationship between manager and team members, sponsors, customers, and stockholders can affect the whole process of a project. There are several ways of communication between manager and aforementioned stockholders, for example email, fax, voice, video, written report, oral presentation, informal memo, formal report. Generally, vital information that is needed toward project communication requirements are as follow: organization charts, stackholders information, external and internal information, logistics, deciplines, departments and specialties, project organization and stackholder responsibility relationships. This is the time that we are faced with large scale data that are collected through communication process. Accurate data analysis of collected data can guarantee the success of a project or the wrong data analysis process might end up in failure. Also, if the collected data via communication were to be truly and perfectly analyzed, it could improve the future project’s performance. To sum up, good way to project management communication project analysis can provide big values for manager and corporation. However, due to the lack of efficient system, Big Data analysis is a big challenge in order to achieve accurate results.
Human Resource Management Human resource management includes human resource planning, acquire project team, develop project team and management project team each of which has own its subsections. The project management plan is includes the activity resources requirements and some of other parameters such as procurement, quality management, and risk management that will be assist human roles and his/her responsibilities. As human resource and responsibilities determined through organization chart and position descriptions in different formats like hierarchical, matrix and text oriented and also may be a project divided into several subsidiary project, so we are faced with a Big Data volume that have to propose a solution for Big Data analysis. With accurate data analysis on Big Data human resource management process and provide a trustworthy report for manager, s/he can utilize a management policies such as human resource training and hiring in project human resource
management section. According to the aforementioned reasons, the generated data by project staff assignment, role and responsibility, project organization charts, work performance information, and project reports needs an accurate analysis in order to predict the project trends and also, take a decision making process in terms of human resource evaluation and human resource hiring.
6.
Conclusion
In this paper, we presented a survey on Big Data both in technical and management aspects and reviewed several algorithms. We described the deficiencies of Big Data as a motivation for project management augmentation. In addition, we surveyed three well known and most credible existing data mining algorithms. Also, data is growing with an exponential speed nowadays. However, corresponding information technology falls behind comparatively. Hence there is much remaining work for us to do about the project management so that we could face the challenges brought by Big Data management. Finally, In the context of Big Data managemenet, several open challenges particularly, Big Data project communications management, Big Data perform predictive analytics, and, Big Data project human resource management are the most prominent open challenges that demand future efforts.
References
[1] Gantz J, & Chute C. (2011): The diverse and exploding digital universe: An updated forecast of worldwide information growth through. International Data Group. [2] Adduci R, Blue D, Chiarello G, Chickering J, & Mavroyiannis D. (2011): Big Data: Big Opportunities to Create Business Value. Technical report, Information Intelligence Group, EMC Corporation. [3] White Tom. (2012): Hadoop: the definitive guide. O'Reilly, O'Reilly Media / Yahoo Press. [4] Herodotou Herodotos, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, & Shivnath Babu. (2011): Starfish: A Self-tuning System for Big Data Analytics, 5th Biennial Conference on Innovative Data Systems Research (CIDR), vol. 11, pp. 261-272. [5] Russom Philip. (2011): TDWI Best practices report: Big Data analytics. The data Warehousing Institute. [6] Grover Raman, & Michael J Carey. (2012): Extending map-reduce for efficient predicate-based sampling, IEEE 28th International Conference on Data Engineering (ICDE), pp. 486-497. [7] Smola Alexander, & Shravan Narayanamurthy. (2010): An architecture for parallel topic models., Proceedings of the VLDB Endowment, No. (1-2), pp. 703-710. [8] Low Yucheng, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, & Joseph M Hellerstein. (2012): Distributed GraphLab: A framework for machine learning and data mining in the cloud, 38th International Conference on Very Large Data Bases , Proceedings of the VLDB Endowment , Vol. 5, No. 8, pp. 716-727. [9] Laptev Nikolay, Kai Zeng, & Carlo Zaniolo. (2012): Early accurate results for advanced analytics on MapReduce, Proceedings of the VLDB Endowment , Vol. 5, No. 10 , pp. 1028-1039. [10] Gupta Rajeev, Himanshu Gupta, & Mukesh Mohania. (2012): Cloud Computing and Big Data Analytics: What Is New from Databases Perspective? In Big Data Analytics, Springer Berlin Heidelberg, pp. 42-61. [11] Labrinidis, Alexandros, & H. V. Jagadish. (2012): Challenges and opportunities with Big Data. Proceedings of the VLDB Endowment, Vol. 5, Issue. 12, pp. 2032–2033. [12] Stuart P. Lloyd SP. (1982): Least squares quantization in PCM. Unpublished Bell Lab. Tech. Note, portions presented at the Institute of Mathematical Statistics Meeting Atlantic City, NJ. Also, IEEE Trans Inform Theory (Special Issue on Quantization), Vol. 28, pp. 129–137. [13] Friedman H.P & Rubin J., (1967): On Some Invariant Criteria for Grouping Data. Journal of the American Statistical Association, Vol. 62, pp.1159-1178. [14] J. B. MacQueen (1967): Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, pp.281-297. [15] Jain AK, & Dubes RC. (1988): Algorithms for clustering data. Prentice-Hall, Englewood Cliffs. [16] Robert M. Gray, & David L. Neuhoff. (1998): Quantization, IEEE Transactions On Information Theory, Vol. 44, No. 6.
[17] Agrawal R, & Srikant R. (1994): Fast algorithms for mining association rules. In: Proceedings of the 20th VLDB conference, pp. 487–499. [18] Tan P-N, Steinbach M,& Kumar V. (2007): Introduction to data mining. Pearson Addison-Wesley.