trademarks of their respective companies, and they are used here for identification ... 4 IBM (2012) Bringing big data to the enterprise - ibm.com/software/data/bigdata/what-is- .... Big data and big data analytics have contributed to 6% increase.
Data & Theory Diving into Big data Challenges
Abdallah Bari
Copyright © 2017 Abdallah Bari Math Coding & Analytics All rights reserved. ASIN: B0728M2GVG
Product or corporate names may be trademarks and/or registered trademarks of their respective companies, and they are used here for identification and explanation purposes. Quotes are also mentioned to support some of the arguments put forward as well as for the purpose of commentary explanation. Some of the quotes were repeated in their original languages to keep the original meaning intended by their respective authors. The sources of quotes, text and/or data are acknowledged and credited for throughout the book and listed in the foot notes. However, if any copyright material, including data, have not been acknowledged or credited for, grateful if you could notify to rectify them in the subsequent updates and editions. This book is intended to contribute to the debate on big data’s challenges and in particular to address epistemic challenges. The book highlights also the opportunities and the importance to address the unexpected sudden shortage in skills requiring a data-literate environment to catch up with data rapid and continuous flow. For technical and other methodological big data challenges including data analytics issues expertise is needed and other books are also available to help with big data issues ranging from data integration (infrastructure), data preparation (ETL) and data analysis. I believe that we need to address big data challenges, in both public and private sectors, and to catch up with both shortages in skills and concepts to not be caught in the middle of rogue waves of big data. While technology is taking far major leaps ahead creating an unprecedented amount of data there is a need for a more data-literate world with more long-term thinking. And if we pay attention, we might help the planet and its inhabitant.
© 2017 A. Bari Math Coding and Analytics Westmount, Quebec, Canada ASIN: B0728M2GVG
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
CONTENTS
Acknowledgments
i
1
Introduction – Big data challenges
1
2
Data integration (RDBMs/Hadoop)
23
3
Data preparation (ETL)
36
4
Theory and Data – Interplay (a priori)
51
5
Analytics (a posteriori)
69
6
Image data (features’ extraction)
92
7
Spatial data (a priori & a posteriori)
110
8
Functional data (Sensor data)
122
9
Data metrics – Information
133
10
Patterns in Data – Machine
140
iv
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
Chapter 1
INTRODUCTION Experience alone, without theory, teaches management nothing about what to do to improve quality and competitive position, nor how to do it. W. Edwards Deming1
Big data has grown tremendously rapidly, in recent years, to an unprecedented position leading to any data to attract more attention than it did with “small” or structured table data2,3. An estimated 2.5 quintillion bytes of data are added to big data, each day, reported IBM4 in 2012. According to the 2011 McKinsey Global Institute (MGI) report there are many ways in which big data is being used, across sectors, to create value, with noticeable changes in the economic landscape and public policy5,6. Many recent reports on digital trends converge with the 2011 MGI’s report highlighting that big data has tremendous potential for growth and cognitive development with the prospects of creating thousands if not hundreds of thousands of much-needed new jobs to be filled within coming few months and few years
W. Edwards Deming (1982) Out of the Crisis. MIT, USA Tom Boellstorff (2013) Making big data, in theory? First Monday 18 (10) 3 Tom Davenport (2014) 4 IBM (2012) Bringing big data to the enterprise - ibm.com/software/data/bigdata/what-isbig-data.htm 5 McKinsey Global Institute (2011) Big data: The next frontier for innovation, competition, productivity. 6 Steven Miller and Debbie Hughes (2017) Burning Glass Technologies 2017 1 2
5
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
(Figure 1.1). There is an urgent need for a new breed of professionals with new skills across all data spectrum, from data integration, data analytics, machine learning, to artificial intelligence. A need of a data-literate environment to catch up with data deluge and for people to thrive according to the 2017 OCED Repot7,8,9. As per Ridsdale et al. (2015) these has been a shift in terms of skills needed from learning facts to acquiring more inquiry skills described as “soft” skills for an increasingly data driven society10. The shortage in skills and the type of skills needed caught both business and academic worlds by surprise and they are now in a rush to catch up wrote Matthew Wall11. A survey conducted by the Economist Intelligence Unit involving around 500 participants representing 25% from medical devices sector, 25% from pharmaceuticals sector, 17% from IT and telecoms sector and 17% from travel and hospitality sector, attributed the challenges facing big data relate to the lack of a priori framing and problem definition as a well as posteriori interpretation12. To assess the needs of data savvy professionals, Burning Glass Technologies used recently (2017) a scoring algorithm to rank and sort out job posts while also cleansing job postings of duplicates from a data “lake” of online job postings containing more than 40,000 sources, reported Steven Miller13. Based on this 2017 assessment Burning Glass Technologies and IBM projected the number of positions for data and data analytics talent to grow Steven Miller and Debbie Hughes (2017) Burning Glass Technologies 2017 IBM Analytics (2017) The Quant Crunch: Demand for data science skills is disrupting the job market Read the study - https://www.ibm.com/analytics/us/en/technology/datascience/quant-crunch.html 9 Italian G7 Presidency 2017 - Presidency of the Council of Ministers http://www.g7italy.it/en/news/ 10 Chantel Ridsdale et al. (2015) Strategies and Best Practices for Data Literacy Education Knowledge Synthesis Report. Dalhousie University, Canada. 11 Matthew Wall (2014) Big Data: Are you ready for blast-off?, BBC News 12 Economist Intelligence Unit (EIU) 2015) 13 Steven Miller and Debbie Hughes (2017) Burning Glass Technologies 2017 7 8
6
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
further in the USA by 364,000 openings, to reach a number of 2,720,000 data professionals by 202014 (Figure 1.1). The need for people with data and analytics skills is more urgent than ever before to catch-up with the technology, which is taking far major leaps ahead. The emerging and next generation of technologies promises to be even more disruptive wrote Miller and Hughes15. The 2011 MGI report forecasted a 50 to 60 percent gap between the supply and demand of people with deep analytical talent. In the USA alone, by 2018, the report estimated a shortage of 1.5 million people to help turn data into insights. Demand of Data Savvy 1,000,000 800,000 600,000 400,000 200,000 0
Number of Postings in 2015
Estimated Postings for 2020
Figure 1.1 : Demand of data savvy professionals in the USA Source: Burning Glass Technologies and IBM (2017)
According to the 2015 Canadian Information and Communications Technology Council (CICTC) Report titled “Big data & the Intelligence Economy" the supply of data scientists and related professionals is well below today's industry demand 16.
IBM Analytics (2017) The Quant Crunch: Demand for data science skills is disrupting the job market Read the study - https://www.ibm.com/analytics/us/en/technology/datascience/quant-crunch.html 15 Steven Miller and Debbie Hughes (2017) Burning Glass Technologies 2017 16 The Canadian Information and Communications Technology Council (ICTC 2015) 14
7
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
Under the section “Labor Market and Skills” of the 2015 CICTC’s report, the number of data analytics specialists directly employed estimated to be of 33,600 and is expected to increase by 33% to reach 43,300 of data analytics professional by 2020.
33,600
43,300
DATA ANALYTICS SPECIALISTS
YEAR 2016
YEAR 2020
Figure 1.2 : Data Analytics Direct Employment Canadian Information and Communications Technology Council (2015)
Similar skills gaps were also reported in Europe where big data’s shortage in data skilled manpower is having an impact on the labor market17,18. The crisis of lack of new Big data skills, with a demand greatly exceeding the supply, transcends international boundaries and in almost all sectors including emerging technologies sector such as cognitive technologies. These emerging technologies will be driving new and high levels of analyses, efficiency, and productivity in the near future. There are different types of cognitive technologies and Tom Davenport listed five key types: 1) robotic process automation (RPA); 2) traditional machine learning, 3) deep learning; 4) natural language processing; and 5) rule-based expert system19.
Eleanor Smith. Big data: labour market Posted in Blog, Social Sciences European Data Forum – a recap from the BDE perspective (2015) https://www.big-data-europe.eu/big-data-labour-market-impacts/ 19 Tom Davenport (2017) Beyond ‘Doing Something Cognitive’ The Wall Street Journal, USA. 17 18
8
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
Big data’s contribution and prospects Big data and big data analytics have contributed to 6% increase in terms of productivity based on the 2016 research findings by the OECD in a time of unfavorable economic climate20. The 2017 United Nations World Economic Situation and Prospects report attributed the unfavorable climate to the world economy being trapped in a self-perpetuating cycle of weak investment, dwindling world trade, flagging productivity growth and high debt. The report mentioned that this paradoxical situation of a slow world economy vis a vis a thriving digital economy is hampering also progress towards the newly launched Sustainable Development Goals (SDGs)21 to be achieved by 2030. A paradoxical situation where big data’s contribution amounted to 5 to 6 percent of higher productivity and output growth reported in 2014 by Massachusetts Institute of Technology (MIT) for companies that adopt data-driven decision making22. In 2015 the ICT sector made a substantial contribution to Canada's GDP with a substantial contribution from software and computer services of up to 5.0%23. The 2011 McKinsey Global Institute report anticipated even more substantial and incremental changes yet to come in the future24. In a time of slow global economy the impact of big data, with opportunity to optimize operations and processes and create new and possibly different and new professions, is real and worthy of a sustained and diligent attention wrote Tom Boellstorff 25. This year (2017) alone has been considered as an year of a change in terms of investments in big data according to Forbes 20 21
Organization for Economic Co-operation and Development (2016) United Nations (2017) The World Economic Situation and Prospects report.
https://www.un.org/development/desa/dpad/publication/world-economic-situation-and-prospects2017/
The Committee for Economic Development 2014 at https://www.ced.org/blog/entry/big-datas-economic-impact 23 The Canadian Information and Communications Technology Council (ICTC 2015) 24 McKinsey Global Institute (2011) Big data: The next frontier for innovation, competition, and productivity 25 Tom Boellstorff (2013) Making big data, in theory. First Monday 18(10) 22
9
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
report26. The 2017 Forbes report suggested that the impetus behind these investments is driven by customer loyalty within a vision that encompasses a more wider and long-term consideration for a better, safer and healthier people’s lives under rapidly changing environment27. Although the term of “Big data” dates back to the 1990s when it first appeared in academic publications, it only starts to be used widely as of 2008 onwards28. Big data since has been growing very rapidly and massively to embrace a variety of other data types and data streams, in less than a decade, and across all fields, ranging from physics, astronomy, genomics, health, business, social science, economics, climate, to environment29. When compared with small data, big data has increased in volume and velocity and expanded to include a variety of other types of data. With the advent of ICT we are confronted today with large datasets that were once confined to the domain of astronomy and high-energy physics30. However, it is worth to mention that big data even if it is large in volume it can also be turned and transformed (chapter 3) into small data sets and that small datasets are as insightful and as actionable. Nevertheless, to say that big data without data can be problematic and vice versa data, as data that has been previously dismissed as chaotic has been found to exhibit hidden patterns. The implications of Big Data are thus more subtle than they appear, as it has epistemic implications.
Forbes (2017) https://www.forbes.com/sites/ibm/2017/02/16/you-may-be-surprised-about-what-theiot-connects-you-with/#7336c83340f3 28 Tom Boellstorff (2013) Making big data, in theory. First Monday 18(10) 29 The Data Warehousing Institute (2011) 30 Vivien Marx (2013) Nature 498, 255 26 27
10
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
Figure 1.3: The Vs of big data : Volume, Velocity and Variety
Big data’s power does not undermine data size or type of data and most importantly does not obliterate the need for vision or people and their insight31. In light of big data a rise is also expected for smaller datasets to help enable fostering AI (artificial intelligence) innovation trend with results for decision making with less data training set32 (Chapter 5). Big data has been sought-for more, in recent years, to help to discover the hidden insights and to create competitive differentiation and increase productivity. Big data has been also sought-for to solve complex and urgent and real world problems across sectors including public sector such as such as water failure caused by breaks in water system in urban areas and to solve problems that could incur unbearable costs and many may require Andrew McAfee and Erik Brynjolfsson Spotlight on Big Data - Big Data: The Management Revolution. Dominic Barton and David Court (2002) Making Advanced: Analytics Work for You. Harvard Business Review 32 Cynthya Peranandam, IBM https://www.forbes.com/sites/ibm/2017/03/02/four-catalysts-to-spark-the-next-wave-ofinnovation-in-artificial-intelligence/#1ef9da2e172c 31
11
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
years to repair. Solving such problems can be time- and resource-bound as the resources are limited and the time may be also limited such as in the case of the Syracuse city in New York (USA) in 201433. The city was in the midst of organizing the Syracuse University basketball game when a dozen of water breaks occurred just few days before the game. To encounter the problem the city used an algorithm developed by a team at the University of Chicago, the algorithm helped to digest large amount of data and to locate the breaks along the city 550 miles of urban water pipes that helped in their timey reparation. It would have taken decades ((generations) to repair reported Debra Bruno34. As water is becoming scarcer with different sectors competing for its use, governments and organizations are also turning to big data to manage water resources by aggregating data across the entire water cycle right, from “green water” (rain) to “blue water” (streams and rivers). They are also using sensors and monitoring systems that will generate large amounts of near real time flows of data to help in managing water resources35.
Debra Bruno (2017) http://www.politico.com/magazine/story/2017/04/20/syracuseinfrastructure-water-system-pipe-breaks-215054 34 Debra Bruno (2017) http://www.politico.com/magazine/story/2017/04/20/syracuseinfrastructure-water-system-pipe-breaks-215054 35 Sunil Jose (2017) https://www.linkedin.com/pulse/can-big-data-used-address-indiasgrowing-water-crisis-sunil-jose? 33
12
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
Figure 1.4: Vital sectors (food and urban water) competing for water use (United Nations)
Figure 1.5: Water cycle – from “green water” (rain) to “blue water” (streams and rivers).
Real-time satellite data is also used to track water productivity in agriculture, given that agriculture uses water the most to produce food. This new initiative has been recently launched by research
13
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
and development organizations to tap on satellite data to help in the optimization of water use in irrigation. This is critical as 7 percent or more of the population will experience a decrease of 20% in renewable water resources for each 1°C increase of temperature due to global warming36. Satellite imagery data is being suggested to monitor building and home architectural evolving states such as gutters and water-leaks to help to take preventative measures to avoid deterioration architectural costs and water leaks repair plaguing urban areas. Seventy percent of the world’s population will live in urban areas by 2050 and the digital transformation along with the Internet of Things (IoT) will highly likely define the cities architecture and related infrastructural utilities including water in the future 37,38,39. Satellite data will revolutionize everything from agriculture, to insurance to finance and may even help to save the planet reported G. Burningham (2016)40. A very recent poll conducted on April 2017 by Big Data Zone 41 listed several real-world problems solved based on big data involving different other sectors spanning from retail, healthcare, media, and tele-communications to finance, government, IT, and to fleet management. The April 2017 Big Data Zone survey involved executives from 22 companies who are working with Big Data or providing Big Data solutions to clients today. Big Data for research and development The prospects of big data to help and provide even more comprehensive views in sustainable development has also led the United Nations to establish a new Independent Expert Advisory fao.org/news/story/en/ Mark Wilson (2012) By 2050, 70 percent of the world’s population will live in urban areas 38 Ramona Albert (2017) 3 big ways sustainable design will shape future cities 39 UNICEF: An Urban World (2010) https://www.unicef.org/sowc2012/urbanmap/# 40 http://www.newsweek.com/2016/09/16/why-satellite-imaging-next-big-thing496443.html 41Tom Smith (2017) Executive Insights on the State of Big Data. Big Data Zone 36 37
14
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
Group to develop recommendations in light of opportunities and possibilities available to people and institutions to delve into big data and to develop the much-needed long-term solutions to achieve the sustainable development goals in light of global challenging changes by 203042. These United Nations developed 17 new Sustainable Developed Goals, which have been launched 1st of January 2016 to be achieved by the year 2030. The model above of the embedded SDGs was developed by the Stockholm Resilience Center based in Sweden with the aim to highlight that economical as well as societal developmental sectors are embedded parts of the biosphere43.
Figure 1.6: This model with embedded SDGs was developed by the Stockholm Resilience Center (Azote Images for Stockholm Resilience Centre – 2017).
Sustainable Development Goals44
Goal 1. End poverty in all its forms everywhere Goal 2. End hunger, achieve food security and improved nutrition and promote sustainable agriculture
United Nations (2016) Data Innovation: big data and new technologies Stockholm Resilience Center. http://www.stockholmresilience.org/research/researchnews/2016-06-14-how-food-connects-all-the-sdgs.html 44 United Nations https://sustainabledevelopment.un.org/topics/sustainabledevelopmentgoals 42 43
15
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
Goal 3. Ensure healthy lives and promote well-being for all at all ages Goal 4. Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all Goal 5. Achieve gender equality and empower all women and girls Goal 6. Ensure availability and sustainable management of water and sanitation for all Goal 7. Ensure access to affordable, reliable, sustainable and modern energy for all Goal 8. Promote sustained, inclusive and sustainable economic growth, full and productive employment and decent work for all Goal 9. Build resilient infrastructure, promote inclusive and sustainable industrialization and foster innovation Goal 10. Reduce inequality within and among countries Goal 11. Make cities and human settlements inclusive, safe, resilient and sustainable Goal 12. Ensure sustainable consumption and production patterns Goal 13. Take urgent action to combat climate change and its impacts* Goal 14. Conserve and sustainably use the oceans, seas and marine resources for sustainable development Goal 15. Protect, restore and promote sustainable use of terrestrial ecosystems, sustainably manage forests, combat desertification, and halt and reverse land degradation and halt biodiversity loss Goal 16. Promote peaceful and inclusive societies for sustainable development, provide access to justice for all and build effective, accountable and inclusive institutions at all levels Goal 17. Strengthen the means of implementation and revitalize the global partnership for sustainable development
Big Data phenomenon is a driving force for knowledge acquisition and value creation, fostering research and innovation 16
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
with a potential to transform most if not all sectors according to the OECD (2016). The OECD report (2016) referred to data as the new research and development (R&D) for 21st century innovation systems, highlighting both big data and its analytics as fundamental inputs to innovation, akin to R&D. The report mentioned, based on available evidence, that companies using datadriven innovation (DDI) have raised productivity faster, by approximately 5-10% when compared to non-users (OECD 2016). The emergence of big Data is allowing researchers to explore further the extent of inferential power of algorithms. Some of these algorithms were used previously with success on modest-sized datasets and by scaling them to big data they could show more promise in government public health, public transportation, economic development and economic forecasting 45. In research big data provides the raw material needed to develop algorithms and also a platform for the new algorithms to operate and to be tested46. Big data is enabling new technologies to develop further such as the Natural Language Generation (NLG) 47, which has been conceived 25 years ago. There is also a growing interest from the public on big data. Newspapers and media have sections on data visualization, such as “information is beautiful” data visualization initiative by The Guardian News media that was conceived and designed by David McCandless48 According to the recent OECD’s Report titled “The Next Production Revolution: Implications for Governments and Business”, prepared within the framework of the 2017 Italian G7 Presidency, data will be central to 21st century production along with the investments in Research & Development. The report Steve Lohr (2012) The Age of Big Data. The New Yor Times Simons Institute for the Theory of Computing 2013 46 McKinsey Global Institute (2011) McKinsey & Company www.mckinsey.com/mgi 47 Robert Dale (2014) The Last Mile in Delivering Information from Big Data, An NLG Thought Leadership. 48 https://www.theguardian.com/data 45
17
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
prepared for the 2017 G7 Presidency by Italy refers also to the importance of our society to be ‘data-literate' and to be involved in the new production and the new emerging technologies to build a long-term thinking49. There are also a number of other research and development issues to be addressed as a result of availability of big data, such as external validity of algorithms in a different “unrelated” set or context of data. Researches are developing mathematical models such structural causal model, that helps to decide on how information from one source should be combined with data from other sources. This approach, which is being developed by Bareinboim and Pearl, with the aim to predict what would be the outcome of the medical treatment they were testing if it was given to another population of people in the real world50. A global research involving several universities, research institutions and NGOs from a number of countries across all the 5 continents, used mathematical models’ prediction that helped to pinpoint climate changes related genes in plants combining big data with mathematics51 and omics52 (genomics and phonemic)53. The prediction were validated in the field to show that the sought after climate related genes or traits have been found and the results were considered as breakthroughs in 201554. The identified genes, some of which have been searched for in vain, have been now incorporated into crop improvements to develop climate proof crops55. Italian G7 Presidency 2017 - Presidency of the Council of Ministers http://www.g7italy.it/en/news/ 50 Matthew Chin (2017) Solving big data’s ‘fusion’ problem - UCLA, Purdue. 51 Mathematics helps find food crops' climate-proof genes By Mark Kinver, BBC News on 15 August 2014 – http://www.bbc.com/news/science-environment-28789716 52 https://www.crcpress.com/Applied-Mathematics-and-Omics-to-Assess-Crop-GeneticResources-for-Climate/Bari-Damania-Mackay-Dayanandan/9781498730136. 53 Stephen F. DeAngelis (2014) Big Data’s Big Role in Agriculture http://www.enterrasolutions.com/2014/09/big-datas-big-role-agriculture.html 54 Mathematical models speed search for plant genetic traits to adapt to climate change 49
https://ccafs.cgiar.org/fr/research/annual-report/2014/mathematical-models-speed-search-for-plant-genetic-traits-to-adapt-to-climate-change 55 Mark Kinver (2014) Mathematics helps find food crops' climate-proof genes By , BBC
News, UK - http://www.bbc.com/news/science-environment-28789716
18
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
Big data phenomena is thus paving the way to explore further the new frontiers of research such as brain health and artificial intelligence. The University of Montreal and partners launched a new institute to advance artificial intelligence, known as IVADO, aiming to further shed light on brain research and advance cognitive analytics56.
Figure 1.7: Big data (climate data) combined with mathematics to search for traits to help crops cope with changing climate conditions
Big Data challenges - Can big data speaks for itself? However, as Big data is growing massively and rapidly with the prospect of becoming the new “R&D” for the 21st century innovation it is also creating unprecedentedly new challenges and new controversies as to whether big data can render science obsolete? A remark made earlier during 2008 by Chris Anderson, the editor in chief of “Wired” who wrote that “the data deluge makes the scientific method obsolete” 57,58.
Karen Seidman (2016), Montreal Gazette of 07/09/ Vermesan et al. (2014) 58 Chris Anderson (2008) Wired magazine 56 57
19
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
As much as Big data is bringing new opportunities it is also bringing new challenges59. To cope with either the challenges and/or the emerging opportunities most sectors are also looking for elaborated frameworks to analyze data wrote Bernard Marr (2015). There are concerns that some of challenges and controversies may slow progress in innovations and delay the development of envisioned or potential future big data applications. Other research has also begun to identify challenges that are not only methodological but also epistemic challenges related for instance to the establishment of theoretical frameworks to scale inferences and machine learning (ML) algorithms60. Epistemology is the study of knowledge acquired to ensure that the results are not true merely because of luck61. "The trouble is, we don't have a unified, conceptual framework for addressing questions of data complexity...Big data without a “big theory” to go with it loses much of its usefulness, potentially generating new unintended consequences." Geoffrey West 62 Does reducing big data to tiny data sets using Hadoop MapReduce be a reasonable procedure? questioned Feldman and Schmidt. Although MapReduce has its own merits by moving computation to where data is located across myriad of servers instead of moving data to our computers, allowing us to access large datasets and across large networks of servers. Are there bandwidth costs implications when moving data around from a centralized server (data base of file management system) or distributed system, to memory or to hard disks? wrote Vincent Granville63. In terms of analytics the size of data can be also a source of Jianqing Fan et al (2013) Challenges of Big Data analysis. Big data challenges M.D. Assunção, R.N. Calheiros, S. Bianchi M.A.S. Netto, and R. Buyyab Big (2015) Data computing and clouds: Trends and future directions. J. Parallel Distrib. Comput. 79–80:3–15 61 Stanford Encyclopedia of Philosophy (2005) Epistemology 62 Geoffrey West (2013) 63 Vincent Granville (2013). A synthetic variance designed for Hadoop and big data 59 60
20
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
spurious correlations and a confounding factor64. Increase in big data dimensionality may also propagate the accumulated noise in data creating unwanted correlations (and variation) of data scaling with noise that need to be considered in the algorithmic processes for leveraging insights from big data65. Sugihara66 noticed that some correlations may only be present intermittently, he called them as “mirage correlations”, when he found that ocean temperatures and fish numbers went out of sync in mid 1970s. The tight correlation that scientists thought they had found earlier between temperatures and fish numbers seemed illusory according to Sugihara as the salmon population appeared to fluctuate randomly. Sugihara remarked also that equations commonly used in modelling may not yield the same success across disciplines as they have done in the physical sciences67. Thus, despite the relevance of data in light of big data there is a need to further elaborate on the mechanisms through which data are collected and in particular the ultimate purpose and rationale behind continuous data gathering, observation and analysis68. David Bollier (2010) cited in the report “Promise and Peril of Big data” several points related to theory formation pointing out that “In practice, the theory and the data reinforce each other…The use of data for correlation allows one to test theories and refines them”. In addition to these new Big data epistemic challenges where we have more data than concepts, there is also a shortage in new data skills to help to address big data’s different issues and new M. Mooij et al. (2016) Distinguishing cause from effect using observational data: methods and benchmarks. Journal of Machine Learning Research 17(32):1-102, 2016). 65 J. Aluya & O. Garraway (2014) The Influences of Big Data Analytics, Is Big Data a Disruptive Technology?. 66 Sugihara G, et al. (2012) Detecting causality in complex ecosystems. Science 338(6106): 496–500. 67 Donald L. DeAngelisa, and Simeon Yurek (2015) Equation-free modeling unravels the behavior of complex ecological systems, PNAS 112 (13): 3856–3857 68 Degli Esposti, S.( 2014) When big data meets data veillance : The hidden side of analytics. Surveillance & Society 12(2): 209-225 64
21
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
challenges. A blend of mathematic with analytical skills can help greatly to have these new skills and be prepared for the labor market of big data analytics according to the 2015 Report by the Information and Communications Technology Council of Canada. “The growth over the past decades in the raw processing speed of computer processing units has been staggering. However, as impressive as this growth in computing power has been, our modern scientific computing capability would not have developed without an equally important investment in the underlying enabling mathematics”69 Similar results were also reported in the USA vis a vis shortage in Big data skills. The McKinsey Global Institute (MGI) forecasts of 50 to 60 percent gap between the supply and demand of people with deep analytical talent and math skills. The shortage estimated amounts to 1.5 million people by the 2018. On overall big data's prospect to turn complex, imperfect as well as often unstructured data into actionable insights, requires people’s creativity across our society and beyond user-centered approach70. “Big data’s power does not erase the need for vision or human insight”. Andrew McAfee and Erik Brynjolfsson71
The lack of skilled data professionals is hindering companies from realizing the expected benefits of Big data based on poll conducted by Big Data Zone72. Executives of Big data companies interviewed in the poll qualify the lack of skilled people as a huge talent gap that is currently facing Big data. Gartner, Inc. insisted, in 2015, to re-focus on mindsets and culture while acquiring tools and skills. Gartner also predicts that 60 percent of big data projects not considering these changes will be limited to just piloting and Independent Panel from the Applied Mathematics Research Community (2008) Applied Mathematics. 70 United Nations - Global Pulse (2012) 71 Andrew McAfee and Erik Brynjolfsson () Spotlight on Big Data - Big Data: The Management Revolution 60, Dominic Barton and David Court (2002) Making Advanced: Analytics Work for You. Harvard Business Review 72 Tom Smith (2017) Executive Insights on the State of Big Data. Big Data Zone 69
22
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
experimenting with big data without real benefits to be foreseen during 201773. A shift of mindset from a system of records towards prediction-as-a-platform (PAAP)74,75. Data and theory (interplay) To address the raise in big data issues and challenges and in particular epistemic challenges, such as whether data alone is sufficient to draw inferences, the Aspen Institute Roundtable on Information (AIRI) convened a roundtable meeting in 2009. The AIRI roundtable meeting was attended by 25 people representing leaders, entrepreneurs and academics that are called upon to examine the implications of these big data emerging issues (inferential issues) on science as well as people. The 2009 Aspen Institute report), which is written by David Bollier, referred to Chris Anderson’s provocative remark that “the data deluge makes the scientific method obsolete”. Chris Anderson, the editor in chief of “Wired” magazine, who attended the Aspen Institute meeting argued that physics and genetics have drifted into arid speculative theorizing because of the inadequacy of testable models and that “big data” today can alone say that “correlation is enough” without a need of models or any theoretical or scientific backing in support of the correlation findings. Chris Anderson’ s provocative remark is more of an epistemological challenge due to the changes brought by big data vis a vis theory-formation in light of “big data”. Hal Varian, Chief economist at Google, who was also present at the 2009 AIRI roundtable meeting with Chris Anderson, highlighted that theory’s role is as crucial as data. A theory allows to extrapolate outside the observed domain. He argued to quote that.
Gartner, Inc. (2015) Analysts to Explore Advanced Analytics at the Gartner Business Intelligence & Analytics Summit 2015, October 14-15 in Munich, 74 James Canton, (2015) “The Predictive Enterprise: Five Trends Shaping the Future of Business,” Huffington Post The Blog 75 Stephen F. DeAngelis (2017) Digital Enterprises: Moving from Systems of Record to Systems of Prediction. Enterra Solutions. 73
23
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
“When we have a theory, we don’t want to test it by just looking at the data that went into it. We want to make some new prediction that is implied by the theory”. David Bollier (2010) In another article titled “Is big data enough? A reflection on the changing role of mathematics in applications” written by Napoletani et al76, the authors revealed concerns on relying on data alone as a methodological approach to problem solving to quote: “The methodological danger is that the flood of data generated by our innumerable measuring devices may convince us that data is enough, that there is nothing beyond the microarray paradigm, and that opaque, enormous, datadriven models are the privileged way to approach phenomena77 Napoletani et al. (2014) McKinney and Bethany who adopted Scriven & Paul’s (1987) critical thinking approach to address big data involving conceptualizing, analyzing, synthesizing, and evaluating information gathered from, or generated by, observation or experience as a guide to action, argued that critical thinking skills are just as important, if not more so, in big data analytics78. “in many cases less is actually more, if data holders can find a way to know what they need to know or data points they need to have” David Bollier (2010) Big data has also led to elaborate on concepts to address issues raised with big data outliers within Map Reduce computation framework. This is where a theory is being developed starting from data rather than elaborating a theory for the data due the changes as a result of Big data. The variance elaborated in Hadoop has Napoletani et al (2014) Napoletani D et al . ; Panza M. ; Struppa D.“ 2014 Francisca Cabrera (2015) 78 Earl H. McKinney Jr. and Bethany D. Big Data Critical Thinking Skills for Analysts— Learning to Ask the Right Questions Niese (2016) Critical Thinking Skills for Big Data Analysts, Twenty-second Americas Conference on Information Systems, San Diego, 2016 76 77
24
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
however, a formula that is notoriously unstable noticed Vincent Granville79. The book focuses on this very interplay between data and theory while also referring to real problems solved using big data and various data analytical tools. The interplay between data and theory is also of relevance to information-theoretic approach as well as to cognitive analytics where a priori and a posteriori can be combined to tackle the challenges of human action recognition80 and big data analytics. Book chapters’ overview To address the complexity of data and to help to get insights and in-turn turn insights into actions this book intends to bring the data to the level of theory to ease and speed up the analysis. The chapters contain also illustrations along with several applications. The book contains three main sections or parts: the first part is on data and theory (a priori) on how they evolved overtime since Euclid’s work to Bachelard’ s and Turing’s work with a reference to Kant and Kuhn s‘ contemporary work. The second section addresses analytical tools used in the analysis of data including structured and unstructured data, spanning from table data, to image data, to spatial data and sensor data (functional data) within the interplay of a priori and a posteriori as well as data and theory perspective. The third part is about the metrics and predictions as a posteriori evaluation or simulations. This part includes also chapters on information and patterns in light of big data with a refocus on the mathematical theoretical framework stressing the dependency between the practical and the “cognitive”. Mathematical equations are considered by Ian Stewart, in his book on the pursuit of the unknown, as the lifeblood or essence or mathematics, science and technology. There are here in this book
79
Vincent Granville (2013) A synthetic variance designed for Hadoop and big data.
80 Cumin and Lefebvre (2016)
25
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
written in brief to support the arguments put forward. “Equations model deep patterns in the outside world. By learning to value equations, and to read the stories they tell us (although these stories could also be told in other ways), we can uncover vital features of the world around us…Throughout history, equations have been pulling the strings of society.” Ian Stewart81
Chapter 1 - This is an introduction to big data’s opportunities and also challenges. This chapter highlights the prospects of big data to help in improving productivity across sectors while also addressing the challenges in particular the epistemic challenges raised as result of the advent of big data. Several cases of big data’s contribution to economy and in solving real and complex problems were mentioned in this chapter. The chapter re-focuses on the epistemic challenges facing big data that may hinder innovations. The chapter suggested thus to consider the importance of the interplay between data and theory. Data could also be considered singular or plural of datum depending on the context. Chapter 2 As data is growing massively and also expanding to include other types of data ranging from structured data to unstructured data, from textual to image, to geospatial and to sensor data. All these data can be aggregated together with other sources of data such as secondary data to carry out comprehensive analytics. Data integration can pull up together different sources of data using different approaches such Hadoop “ecosystem” approach involving non-relational database systems. Hadoop approach takes the analytics to data through its MapReduce algorithm, which helps to address large amount of data rapidly and thus allowing scalability.
81
Ian Stewart (2012)
26
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
Chapter 3 – This chapter addresses data preparation whether it is quantitative or qualitative, continuous or discrete, it could also be fuzzy, in between. Any raw data may require in addition to its integration laborious data preparation and pre-processing prior to its analysis, a process that may take substantial amount of time estimated to reach up to 80% of the time needed to carry out any in-depth data analysis. This data preparation process is as decisive as data analysis to avoid cofounding factor effects in the analysis. Chapter 4 – This chapter focuses on the interplay between empirical data and theoretical (epistemic) frameworks in support of practice (realism). Empirical data may not be pre-set (implicit) but the results (explicit) of this interchange between alternating practical provisions and theoretical presuppositions, between realism on one hand and rationalism on the other. Rationalism as defined epistemologically by Charlie Huenemann has among other as characteristics82: A priori knowledge (a priori as a route to knowing about what it is true) Cognitive capability to distinguish between the intellect and the imagination Reason reliability over empirical justification
This chapter relates to big data influence on analytics as datasets that were originally dismissed as chaotic revealed presence of patterns and new models were developed to capture fluctuating data correlations due to changing climate conditions83. Presence of confounding factors in data can be problematic such as the artifacts observed in the geometrical data, known by Muller-lyer illusion where line segments ending with outward fins appear longer when compared with line segments ending with inward fins, despite having the same length84.
82
Edwin Mares () A priori p. 34)
Baret (2014) 84 Artsetein 2014, p 65, Ninio 2016 - Ninio, J. (2014) ) [16]. Geometrical illusions are not always where you think they are: a review of some classical and less classical illusions, and 83
27
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
Asking questions or stating problems can generate information that can in turn be used to develop further rules85. Formalization of such procedures a priori in data analytics including the use of (expected) information as result of questions based on Information-Theoretic approach86 can help in the analysis (chapter 5 and chapter 9). Chapter 5 As with data gathering, data analytics have played an important role in our society and our economy. Analyzing data, however, remains challenging and labor intensive (IBM)87. It is even more challenging as data is expanding to include other types of data and broadening further the analytics beyond descriptive and predictive analytics to more prescriptive analytics with fact based knowledge tools such as finding the needle in a haystack and providing best set of analytics to make the most optimized decisions. While this shows the effectiveness of big data and big data analytics overall, the terrain is vast to traverse counting on data alone to successfully leverage big data insights. Data analytics requires also having people in board, people with new skills to formulate pertinent questions, some of which may haven’t even been yet asked88. Sophisticated analytics are emerging to substantially improve decision making, minimize risks, and unearth valuable insights that would otherwise remain hidden. Big data provides also the new and different raw material that will help to develop new algorithms or for those algorithms to operate and validate89 Chapter 6 - This chapter expands the analytics to image analytics ways to describe them. Frontiers in Human Neuroscience, 8, 856. http://doi.org/10.3389/fnhum.2014.00856 85 Berthold and Hand p. 212 86 Y.Y. Yao () Information-Theoretic approach Measures for Knowledge Discovery and Data Mining. 87 Rob Thomas, IBM (2017) Analytics https://www.forbes.com/sites/ibm/2017/02/15/machine-learning-ushers-in-a-world-ofcontinuous-intelligence/#5cf8137a4008 88 Ernst & Young Global Limited report of 2014, 2011 EMC Corporation.. 89 McKinsey Global Institute (2011) - www.mckinsey.com/mgi
28
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
considered to bear tremendous potential in solving economic and industry as bandwidth is no longer of a concern90. Imaging techniques allow the generation of massive datasets including DNA microarrays images. Analytical tools is helping to associate the genes with the phenotypes, compute the molecular weights and compare DNA patterns. The automation process decreases the effort and time spent processing data from gel images by providing an automatic step-by-step gel image analysis system with friendly graphical “user” Interfaces. Chapter 7 – This chapter refers to spatial data as a predecessor of “Big Data” involving large datasets before the internet age. Spatial data dealt with large datasets and patterns earlier leading to the arise of the issue to develop and elaborate on theories to explain the underlying relational patterns found as a result of large-scale measurement of social variables. The elaboration of theories were based on natural science, in particular social physics. Spatial data analytics included then machine-based computing and mathematical formulas that are considered as the grounds on which Big Data later staked its claim to knowledge today according to Barnes and Wilson91. Chapter 8 – This chapter in contrast to chapter 7 refers to functional data where theories and pertinent concepts were formulated earlier. Functional data related theories were elaborated for data instead of being developed totally from data. Functional data consist of observations of functional nature such as curves and surfaces that capture both quantity and quality such as reliability performance and nature of many processes. They are becoming the leading edge of streaming data today as a result of myriads of devices and sensors such as thermometers sensing temperature. Big data is being generated by many different sources, including digital imagers (telescopes, video cameras, MRI machines), chemical and biological sensors (microarrays,
90 91
Fritz Venter and Andrew Stein (2012) at http://analytics-magazine.org/images-a-videos-really-big-data/). Trevor J Barnes and Matthew W Wilson (2014).
29
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
environmental monitors), and even the millions of individuals and organizations generating web pages92. Although their features are very different when compared to traditional “vector” or table data their analytics were well established earlier. Functional data are gaining momentum as there are more real observations of most devices. Chapter 9 This chapter is an extension to chapter 5 where metrics ae use to measure model predictive performance. There are metrics that can help to see whether data exhibit patterns such as AUC, Kappa, fractals, Hurst or the variogram. The chapter refers to quantified and measured information based on Shannon's mathematical definition of information, where a robust features should reveal values beyond what is expected, as in the tails of the probability distribution function93. Information generated from data is a “socle” for product development and an important enabler for competitiveness and progress94. Knowledge accumulates best through theory-forming and rigorous tests for these theories. Either implicitly or explicitly but there is some (a priori) information or process taking pace before a theory can be formed on the basis of that information (a priori) data is turned into information, which in turn into knowledge in rather a circular path that a linear path. There is a connection between a priori and a posterior within the context of rationality and cognition95. Chapter 10: This chapter is on patterns as presence of patterns in data can be an indication for possibility for prediction. Part of the lure of Big Data is that it may reveal entirely new, unexpected patterns. Therefore, scientists and researchers have worked to develop statistical methods that will uncover these novel Randal E. Bryant Carnegie Mellon University Randy H. Katz University of California, Berkeley Edward D. Lazowska (2008) Big-Data Computing: Creating revolutionary breakthroughs in commerce, science, and society; University of Washington Version 8 92
93
Battail (1997) Feldman & Singh ()
Faulkner, W. (1994). Conceptualizing Knowledge used in Innovation: A Second Look at the Science-Technology Distinction and Industrial Innovation. Science, Technology, & Human Values, 19, 4, 425-458. 95 Tuomas E. Tahko (2011). 94
30
DATA & THEORY – DIVING INTO BIG DATA CHALLENGES
relationships. While some patters are from observations some other pattern grew out of an hypothesis-testing tradition, not out of the extensive pattern recognition literature.
31