Spatial Data Mining in the Context of Big Data

3 downloads 33553 Views 1MB Size Report
Keywords- Big data; Spatial data mining; Data intelligence. I. SPATIALLY DISTRIBUTED ... technology applications and analysis spatial data, such as computers ...
2013 19th IEEE International Conference on Parallel and Distributed Systems

Spatial Data Mining in the Context of Big Data Shuliang WANG Hanning YUAN School of Software Beijing Institute of Technology 5 South Zhongguancun Street, Haidian District, Beijing, 100081, P. R. China Email: [email protected] [email protected]

For the tools of collecting spatial data, it can be macro and micro sensors or devices such as radar, infrared, optical, satellite multispectral scanner, digital camera, imaging spectrometer, electronic total station, telescopes, television camera, electron microscopy imaging and CT imaging. It can be spatial data acquisition means such as conventional field measurements, census, land resources survey, map scanning, map digitizing and statistics charts. It also can be a process of technology applications and analysis spatial data, such as computers, networks, GPS, RS and GIS. The concrete content includes data sources, the original observations (raw data), as well as steps, format conversion, date, time, location, personnel, environmental, transmission and history of data collecting, editing, storage and using [21-30].

Abstract—In this paper, spatial data mining is presented in the context of big data. Spatial data plays a primary role in big data, attracting human interests. Spatial data mining confronts much difficulty when attracting the value hidden in spatial big data. The techniques to discover knowledge from spatial big data may help data to become intelligent. Keywords- Big data; Spatial data mining; Data intelligence

I.

SPATIALLY DISTRIBUTED BIG DATA

Big data is complex data set that has the following main characteristics: Volume, Variety, Velocity and Veracity [1] [2] [3] [4] . These make it difficult to use the existing tools to manage and manipulate [5]. In these data, spatial data specifically accounts for the vast majority. About 80% of data is associated with the spatial position [6]. Spatial data is the basis of data and source of wisdom for people to understand the real-world through the information world [7-10]. Big Data is closely related to applications [11], and spatial data mining is its principal application [1][9-15].

The remote sensing Earth observation has become an indispensable part of social, [24-26]. Spatial sensors, satellite launch, satellite control and other hardware technology has made a major breakthrough. In the future, space-based information systems and Earth Observing System (EOS) will try to establish the ability to get a variety of spatial data realtime and around-the-clock. Moreover it will gradually forms the satellite (group) of EOS: set of high spatial, high spectral, high temporal resolution, wide ground covered and other performance, along with positioning, communications and observation functions (Fig. 1, Fig. 2). The rapid development of sensors also makes the number of bands that describing the space object attributes increase from a few to dozens or even hundreds. Remote sensing Earth observation technologies are supporting a multi-level, multi-angled, all-round and aroundthe-clock global stereoscopic earth observation network which combines with high, medium and low orbits and cooperates with large, medium and small satellite, and coarse, fine resolution complement to each other. The range of ground resolution magnitude of the sensor is from kilometers to centimeters. The range of wavelength is from ultraviolet to long wave. The interval is from ten days once to three times a day. The range of probing depth is from a few meters to ten thousand meters. Quick Bird, IRS, IKONOS have provided high resolution remotely-sensed data through new remote sensing earth observation technology marked by high spatial, high spectral and high dynamic. Earth observation system with multi-sensor, multi-purpose, multi-resolution and multifrequency can provide Moderate-resolution Imaging Spectroradiometer (MODIS) data, ASTER infrared-imagecontained data, CERES model data coming from clouds measuring and 4-D simulating, MOPIT data and MISR data. New satellite sensors with high-resolution and high dynamic

Big data centers data, and mines knowledge in the entire data, breaking the sampling randomness of the sample[11], and demonstrating on big data center and mobile terminal. These information technologies serve for the understanding and transforming of the real world. Big data is attracting much attention, for example, the United Nations, the United States, the People’s Republic of China, Science, Nature, the Wall Street Journal, Gartner, McKinsey, Microsoft, IBM, Oracle, Google, and Baidu[1-5][16-20].

A. Spatial data is the basis for big data The content of spatial data describes spatial objects’ specific geographic orientation and spatial distribution in the real world. Spatial data also includes space entities attributes, the number, location, and their mutual relations, and covers the entire Macro-Meso-Micro level. The data can be the value of point height, road length, polygon area, building volume and the pixel gray. It can be the string of geographical name and annotation. It can also be graphics, images, other multimedia, spatial relationships and other topologies [7] [8]. Compared with the General data, spatial data is with characteristics of spatial, temporal, multi-dimensional, Volume, complex spatial relations and so on [1-5].

1521-9097/13 $31.00 © 2013 IEEE DOI 10.1109/.87

486

are not only high dynamic-band numbeer, high spectral resolution, high data rate, short period, buut also particularly large amount of data which can reach gigabbit level or above. For example, the space remote sensing daata obtained from

Fiigure 1.

EOS-AM1 and PM1 in one day can reach terabyte level. Landsat can get satellite image data global coverage every two weeks, and there has been deccades of history data around the world accumulated[1-3][8-9].

Satellite-based earth observation network[6]

Figure 2. Satellite data and services[6]

of urban planning database, enggineering geological information database, current land use database, strategic planning information database, regulatoory detailed planning database, municipal red line database, arcchitectural columns and building line database, cadaster databasees and basic farmland protection planning database. Furthermorre, in addition to those already

The speed of the construction of spatial data infrastructure is increasing, as well as the accumulation off space-based data [9] . Spatial data infrastructure has already contained a large number of cities in the electronic map databbase, road network

487

of software running to test the practicality of the hybrid computer system. NASA Planetary Data System is a planetary mission data archives, and has become the basic resources for scientists all around the world, one of which is planetary discipline online catalog system, via which all products can be peer-reviewed, improved records and do access or queries. Multitasking archive of NASA Space Telescope Science Institute is a part of the distributed space science data services which focuses mainly on the spectrum and other sciencerelated data sets of optical, ultraviolet and infrared. It also provides various astronomical data archives that supporting a variety of tools to access. NASA Earth System Grid Federation is a public archives and cooperates with the U.S. Department of Energy to provide observations and models for the Federal Government. Word Wind, released by NASA is an open source geographic science software, which shows the three dimensional Earth model provided by NASA, USGS, and other WMS vendor. Through this software users can browse the historical image data, monitor disaster events by Modis data and dynamic monitor of global temperatures.[9][31]

stored and accumulated data, new spatial data are being generated and collected.

B. Spatial data is the focus of the research and development big data The research and development of big data involve national security, life and health, climate variation, geological survey, disaster prevention and reduction and Smart Planet, which are all associated with spatial data [1][4][13]. For example, in the United States, government established National Information Infrastructure in 1993, released National Broadband Plan in 2010, and invested 200 million dollars to start Big Data Research and Development Initiative. In this plan, US Geological Survey and National Aeronautics & Space Administration are most closely associated to spatial data [2] [3]. The USGS John Wesley Powell Center for Analysis and Synthesis provides scientists with the place and time to indepth analysis, the most advanced computing ability and the collaboration tools to perceive big data sets, which eventually promotes the innovation thinking of the Earth System Science (ESS). In this center, scientists cooperate to synthesize the comprehensive and long-term data, and further convert big data and the big ideas of Earth science theory to scientific discoveries with the aim to improve the understanding of the ESS and response capabilities. For example, species respond to climate change, earthquake recurrence rate and the next generation of ecological indicators.

In addition, special mitigation satellites, remote sensing satellites, communication and navigation satellites have been widely used in a variety of disaster management, such as earthquakes, tsunamis, typhoons (hurricanes), floods, droughts, geological disasters and fire. Currently, Disaster Monitoring Constellation is a system that consists of satellites from various countries through international cooperation. This system has these follow characteristics, high time resolution, large monitor range and fast response. It can be widely used disaster reduction process and disaster mapmaking in floods, hurricanes and fires.

NASA uses advanced information systems to mature big data capacity in order to support future Earth observation missions. Firstly, the Earth information can be identified by NASA Climate Center. Secondly, it can further reduce the risk, cost, size and development time of space-based and groundbased information system in the Earth Sciences Department. Finally, it can also improve the accessibility and practicality of scientific data. NASA Earth science data and information system project has been active for more than 15 years. This project aimed at archiving, processing and publishing Earth satellite data, air and ground truth data, in order to ensure that scientists and the public can access data from the Earth into space, and enhance the ability to respond to climate and environmental change. NASA established the global Earth observation system is an attempt of international cooperation to share and integrate Earth observation data. It teamed up with the U.S. Environmental Protection Agency, the U.S. National Oceanic and Atmospheric Administration and many other agencies and countries. It has integrated satellite and groundbased monitoring and modeling systems to assess environmental conditions and predict events (such as forest fires, population growth and other natural or human development issues). In the foreseeable future, researchers will integrate the complex air quality information in order to have a better understanding and find a better method to deal with the air quality impact on the environment and human health. NASA sets a space action agreement with Cray Company, which allows one or more projects cooperate around the development and application of low-latency big data system. NASA also uses the highly integrated non-SQL databases to transfer data, in order to accelerate the modeling and analysis

II.

SPATIAL DATA DISASTER

Spatial data is closely related to human daily life, permeated all walks of life. The number, size and complexity are all in sharp increasing. A large amount of data has been stored in the spatial database and warehouse in types of text, graphics, images and multimedia [9-10][15]. The research from International Data Corporation [16] has shown that, as of 2003 humans have created a total of 5EB data, while in the year of 2011, the amount of data that had been copied and produced is exceeded 1.8ZB. It is expected that by 2020 global data usage will reach 35.2ZB, which needs 37.6 billion hard drives of 1TB capacity to store. On the one hand these data broadens the scope of available spatial data available for human to gain wisdom. On the other hand the value of a single unit of the data is rapidly declining. Human is submerged by the data ocean but thirsty for knowledge[32-35]. Big data is voluminous and it grows quickly, but it has very low density in value, which means there is a lot of junk data [31]. Spatial data from the real world is polluted [33]. There are a variety of causes lead to spatial data incomplete. Inaccurate spatial data means incorrect value compared with the reality of the physical attributes. Duplicate records means one real-world recorded is stored in a multi-source spatial data tautologically, or the real-world information is duplicated stored in multiple systems. Inconsistent spatial data refers to the case that space

488

data context-related conflicts are caused by different systems and applications on type, format, granularity, synonyms, encoding, and so on.

means collecting spatial coordinates and attributes of the surface point-by-point through total station, GPS receivers and other conventional ground-based measurement. The plane acquisition means collecting images of large areas through air and space with remote sensing methods, to extract the geometric and physical properties. The mobile acquisition means integrating the use of space positioning system (currently mainly refers to the Global Positioning System), Remote Sensing (RS) and Geographic Information Systems (GIS) to collect, store, manage, update, analyze and apply spatial data in EOS.

The production, transmission, replication and accumulation of data have gone far beyond people's capacity for analyzing, understanding and implementing [34]. Over time, all walks of life are submerged by contaminated data garbage, and then it could lead the big data into "garbage in, garbage out", and the “big data” becomes the useless “big garbage”. Now, useful data is buried, and implied value is blanked in big data[35].

III.

Big data storage technology is the basis for data mining. It is designed to meet the growing need for data storage, which aims to provide scalability, high reliability, excellent performance data storage, access, and management solution, such as distributed data storage, multiple levels caching, load balancing, fault-tolerant mechanisms. Conventional methods are not adequate for these missions. It needs to establish a large platform for data through software, to provide places to store and interface to access. In February 2012, researchers at the University of York in UK developed a heat rather than the magnetic field of computer hard disk data storage technology, that reduces energy consumption while reach the storage speed of thousands of GB per second [32]. In October 2012, the Fujifilm and IBM developed a barium ferrite particles coated tape, which can store 35TB data in the size of 10cm × 2cm [33]. In December, the Massachusetts Institute of Technology synthesized the third magnetic state herbertsmithite pure crystal, which may have a tremendous impact on magnetic storage technology [34].

THE VALUE OF SPATIAL DATA

Adam Smith, a well-known Scottish economist, believed that “useful things” can be treated as capital. Data is valuable, and its value is in increasing via adaptive self-learning. Spatial big data is collected from numerous and interconnected sources. Real usefulness is its maximum value. The generally accepted rule of big data is “decision on data”. The first prerequisite is to keep data always useful and activated. The ultimate value of big data is to gain human intelligence [9][11]][38]. Big data provides an unprecedented opportunity to observe the real world in a full view rather than partial samples [4] [17] [39] . Big data are treated as Basic resources. McKinsey [4] believes that data is the basic resource, and can be compared with physical assets, human capital, create significant value for the world economy, improve the productivity and competitiveness of the enterprises and the public sector, and create a large number of economic surplus for consumers. In 2012, Gartner believes that “Big data is big money”. The U.S. government considers big data as “new oil” related to the country's economic restructuring and industrial upgrading [2] [3][40].

IV.

Big data processing is to implement the transitions: from data to information, from information to knowledge and from knowledge to wisdom. Big data processing includes object superposition, target buffer, spatial data cleaning, spatial data analysis, spatial data mining, spatial feature extraction, image segmentation, and image classification. Big data expression technology is designed to represent the data in a clear and effective way that reveals meaningful information to the user, or provide the user with a new perspective of view. Big data expression technology includes triangulated irregular networks, digital elevation models, digital terrain models, flat maps, three-dimensional maps, image maps, and digital city maps.

SPATIAL DATA MINING

Spatial data mining refers to the basic technologies to realize the value of big data, relocate data assets, and use it effectively. Spatial data mining can be used to extract information from data, mine knowledge from information, extract data intelligence in knowledge, improve the ability of self-learning, self-feedback adaptation, finally realize human-machine intelligence[9][41].

Big data quality assessment technology is aimed to avoid the risk of big data collecting and high-density measuring. The technology includes logical assessment method, exception value based assessment method, and accounting based assessment method.

A. Basic big data technology The basic techniques of big data include data collection, storage, processing, expression, and quality evaluation [18].

B. Discovery Spatial knowledge Spatial knowledge discovery is the technology that uses spatial data mining method to extract previously unknown, potentially useful, and ultimately comprehensible rules. It is also a process of gradual sublimation from spatial data to space information, and to space knowledge, step-by-step. Spatial data mining systems aims to make spatial data gradually summarized into spatial knowledge. Through the integration of space data, it can

Big data can be generated in mobile devices, tracking systems, radio frequency identification devices (RFIDs), sensor networks, social networking, Internet search, automatic recording systems, video archives, e-commerce, as well as the process in analyzing those data. The acquisition mode of big data can be divided into three types: Point acquisition, plane acquisition, and mobile acquisition [11]. The point acquisition

489

individuals, so that it improves the interaction clarity, efficiency, flexibility and response speed. It changes from the traditional single dimension such as: production consumption, management, or planning execution, to a new multidimensional collaborative relationship. In this new relationship, both individuals and organizations can freely contribute and get information and expertise accurately and timely. This new relationship exerts a positive influence on each other to reach smart running macro-effects[40].

deeply extract space knowledge. By using such new knowledge, data can be processed in real time in order to understand and apply the data, to make intelligent judgments and well-informed decisions. Space knowledge can be selflearning, self-enhance, universal, and easily recognized. It could serve as a basis for decision support. If businesses take full advantage of spatial knowledge, it will be more precise and dynamic for humans to learn, work, life, and achieve wisdom state. It will help to improve resource utilization and productivity level. Moreover, it will also help to respond to the economic crisis, the energy crisis, the deterioration of the environment and many other global issues.

Spatial data prompted the fusion of world's digital infrastructure and physical infrastructure. Almost anyone or anything can be connected to a digital network at low-cost. It is easy to embed sensors in a variety of ecosystems. It can be more sophisticated, dynamic management of production and life to achieve the smart status, through global infrastructure of equipment and devices, via the integration of human society and the physical system by Internet, with supercomputers and cloud computing. For example, space knowledge can be found through integrating the data from satellite positioning systems, sensors, and wireless networks. , Spatial knowledge is then transferred to the mobile phone terminal, which helps users make reasonable judgment based on location services in order to achieve the intelligent data transformation into wisdom (Fig. 3). Firstly, the reality space world is thoroughly studied through the acquisition of spatial data from satellite positioning systems, sensors and wireless networks. Secondly, the spatial data is stored and managed by appropriate method, in order to integrate into spatial information. Thirdly, a variety of spatial knowledge is extracted through purposefully data mining methods and mining model. Spatial knowledge is integrated to the new available knowledge, in order to reach deeper data intelligence. Finally, data intelligence is integrated into the digital earth and the "Internet of things", in order to enhance the wisdom of users and machines, and to achieve smarter real world data interaction [41-43].

C. Extraction data intelligence Data intelligence is the ability to obtain a more innovative, systematic and comprehensive knowledge to solve a particular problem through an in-depth analysis of the collected data. It is an ability to understand and solve problems fast, flexibly and correctly. Spatial data intelligent has three features: more thoroughly perception, more extensive interoperability, and deeper intelligence. The three features are aimed to get bigger and more comprehensive data, to share and co-operate data via the Internet, to do data analysis and data mining by variety of advanced techniques, and to constitute a hierarchy of spatial data intelligences (Fig. 3) [35] [36]. More in-depth data intelligence is to create new value of data. On the one hand, when making full use of spatial data knowledge in all walks of life, it can produce secondary knowledge. In order to form a mining mechanism to mine knowledge in knowledge, it needs to bring primary knowledge together to form an intelligent form of expression. Ultimately, the destination knowledge can be achieved. On the other hand, based on a general industrial or socio-ecological system, it can redefine the interactive mode of government, companies and

Figure 3. Location-based data intelligences[6]

490

V.

[17] Reshef D N et al., 2011, Detecting novel associations in large data sets, Science, 334, 1518 [18] Surhone L M, Tennoe M T, Henssonow S F, Big Data: BigTable, Cloud Computing, Database Theory. Betascript Publishing, 2010 [19] Tu Z 㧘 Big Data: Data revolution is coming 㧘 Guangxi Normal University Press, 2012 [20] Zhu Z, Yu C, Yan L, Big data: Big value,Big chance,Big Change, Electronics Industry Press, 2012 [21] Barabasi A.-L.㧘Bursts: The Hidden Patterns Behind Everything We Do 㧘Plume Books㧘2011 [22] Meyer-Schoenberger V, Cukier K, Big Data: a Revolution That will Transform How We Live, Work and Think, London:John Murray, 2013 [23] Bian F L, Digital vision to see the world, Wuhan: Wuhan University Press, 2011 [24] Wang S L, Zeng Y X, Yuan H N, Service Science Introduction, Wuhan: Wuhan University Press㧘2009 [25] Wang S L, Shi W Z, 2012, chapter 5 Data Mining, Knowledge Discovery. In: Wolfgang Kresse and David Danko (eds.) Handbook of Geographic Information (Berlin: Springer), pp.123-142 [26] Li D R, Wang S L, 2007, Spatial data mining and knowledge discovery, Advances in Spatio-Temporal Analysis: in: ISPRS Book Series, Vol. 5, 171-192, Editor(s) - Xinming Tang, Yaolin Liu, Jixian Zhang, Wolfgang Kainz, London:Taylor and Francis [27] Burstein F, Holsapple C W, Handbook of Decision Support System, Berlin: Springer , 2008 [28] Cressie N, Statistics for Spatial Data (revised edition), New York: John Wiley and Sons Inc. 1993, [29] Haining R., Spatial Data Analysis: Theory and Practice, Cambridge: Cambridge University Press, 2003 [30] Koperski K, A Progressive Refinement Approach to Spatial Data Mining. Ph.D. Thesis. British Columbia: Simon Fraser University, 1999 [31] International Data Corporation, Electronic Medicines Compendium. 2011 IDC Digital Universe Study: Big Data is Here, Now What?. June 28, 2011 [32] Smets P, Imperfect information: imprecision and uncertainty. In: Uncertainty Management in Information Systems. London: Kluwer Academic Publishers. 225-254, 1996. [33] Smithson M J, Ignorance and Uncertainty: Emerging Paradigms. New York: Springer-Verlag, 1989 [34] Kim W. et al., A taxonomy of dirty data. Data Mining and Knowledge Discovery, 2003,7: 81–99 [35] Hernàndez M A , Stolfo S J, Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 1998, 2, 1-31 [36] Dasu T, Exploratory Data Mining and Data Cleaning. New York: John Wiley & Sons, 2003 [37] Paul M, Cassette tapes are the future of big data storage. New Scientist, 2887, p20, 2012 [38] Smith A, An Inquiry into the Nature and Causes of the Wealth of Nations. New York: The Modern Library, 1776. [39] Ostler T.A., et al. Ultrafast heating as a sufficient stimulus for magnetization reversal in a ferrimagnet. Nature Communications 3, 666, Feb. 07, 2012 [40] Wang S L, Spatial data mining under smart earth, Proceedings of 2011 IEEE International Conference on Granular Computing, 2011, pp.717722. [41] Craglia M, Bie K, Jackson D, Digital Earth 2020: towards the vision for the next decade , International Journal of Digital Earth, 2012, 5(1):4-21 [42] Wang S L, Gan W Y, Li D Y, Li D R, Data Field for Hierarchical Clustering. International Journal of Data Warehousing and Mining, 2011, 7(2) 43-63 [43] Morgan Stanley. Cloud Computing Takes off Market Set to Boom as Migration Accelerates. May 23, 2011.

CONCLUSION

The development of big data extends the scope of human activities. The world has been cooperating and integrating on a global scale. It redefines the relationship among individuals, businesses, organizations, governments, and societies through networked thinking and further to improve the human living environment, to enhance the quality of public services, to improve performance, efficiency and productivity through the interactive operation. The technological progress and industrial upgrading of big data will create new markets, business models and industry rules, and more importantly it demonstrates the collective will of a country that looking for strategic advantage. Although there is still a large gap to gain data intelligence like human wisdom, big data is a promising topic to understand the world from an entirely new aspect. ACKNOWLEDGEMENTS This project is supported by the NSFC (61173061, 71201120), and the Doctoral Fund of Higher Education (20121101110036). REFERENCES [1] [2]

[3]

[4] [5] [6] [7]

[8] [9] [10]

[11] [12]

[13]

[14]

[15] [16]

United Nations Global Pulse, 2012, Big Data for Development: Challenges & Opportunities, May 2012 Office of Science and Technology Policy | Executive Office of the President, 2012, Fact Sheet: Big Data across the Federal Government, March 29㧘2012㧘www.WhiteHouse.gov/OSTP Office of Science and Technology Policy | Executive Office of the President, 2012, Obama Administration Unveils ̌Big Data̍ Initiative: Announces $200 Million in New R&D Investments, March 29㧘2012㧘 www.WhiteHouse.gov/OSTP McKinsey Global Institute, 2011, Big Data: the Next Frontier for Innovation, Competition, and Productivity, May 2011 Rajaraman A, Ullman J D㧘Mining of Massive Datasets, Cambridge University Press, 2011 Grossner K, Goodchild M, Clarke K, Defining a digital earth system. Transactions in GIS, 12(1): 145-160, 2008 Densham P J, Goodchild M F, Spatial decision support systems: A research agenda, In: Proceedings GIS/LIS'89, Orlando, FL., 1989, pp. 707-716. Li D R, Wang S L, Li D Y, Theory and Application of Spatial Data Mining (the fisrt edition), Beijing㧦Science Press, 2006 Li D R, Wang S L, Li D Y, Theory and Application of Spatial Data Mining (the second edition), Beijing㧦Science Press, 2013 Ester M et al., Spatial data mining: databases primitives, algorithms and efficient DBMS support. Data Mining and Knowledge Discovery, 2000, 4, 193-216 Miller H J, Han J, Geographic Data Mining and Knowledge Discovery (the second edition), London: Taylor and Francis, 2009 Schoier G., Borruso G., 2012, Spatial data mining for highlighting hotspots in personal navigation routes, International Journal of Data Warehousing and Mining, 8(3):45-61 Bimonte S., Bertolotto M., Gensel J., Boussaid O., Spatial OLAP and Map Generalization: Model and Algebra. International Journal of Data Warehousing and Mining, 8(1): 24-51, 2012 Silva R., Moura-Pires J., Santos M. Y. 2012, Spatial Clustering in SOLAP Systems to Enhance Map Visualization. International Journal of Data Warehousing and Mining,8(2): 23-43 (2012) Lapkin A, Hype Cycle for Big Data, 31 July 2012, Gartner, Inc. | G00235042, 2012 Shekar S, Xiong H(Eds.), Encyclopedia of GIS. New York: Springer, 2007 Mills M P, Ottino J M, The Coming Tech-led Boom[N], 2012-10-12㧘 www.wsj.com

491