Open Access: Data Harvesting /Mining

41 downloads 7346 Views 188KB Size Report
Data Mining refers to the extraction of hidden predictive information from data warehouse or from multiple distributed databases scattered over physical ...
Open Access: Data Harvesting /Mining Gaurav Gupta, Virendra Kumar and A K Shukla Central Scientific Instruments Organisation (CSIR), Chandigarh Email: {gauravgupta, virendrakumar, akshukla}@csio.res.in

Abstract Data, the lowest level of Knowledge Pyramid which acts as a foundation for Knowledge abstraction and inferences, can be defined as a collection of information that represents variables having qualitative and quantitative properties. Data Mining refers to the extraction of hidden predictive information from data warehouse or from multiple distributed databases scattered over physical geographic boundaries. Data Harvesting refers to the congregation of data from heterogeneous sources into a particular database from where various inferences can be drawn. Data Harvesting/Mining can be achieved by creating virtual agents that can extract and manage data from various sources into data warehouse and can even publish data to multiple destinations at the same time. Open Access refers to the freedom for a researcher or analyst to access data from multiple heterogeneous database without the boundations of patents or copyright to search various hidden patterns and deduce the inferences that enriches the knowledgebase which can be redistributed. Open Access is the medium through which the power of Data harvesting or Data Mining can be achieved for the development of various innovative tools such as web search engines and Open Source Drug Discovery (OSDD) etc. OSDD can be considered as the best example of Open Source Data Harvesting/Mining. As per OSDD statement: “A concept to collaboratively aggregate the biological and genetic information available to scientists in order to use it to hasten the discovery of drugs. This will provide a unique opportunity for scientists, doctors, technocrats, students and others with diverse expertise to work for a common cause." In this paper, our emphasis is to thoroughly examine the importance and benefits of Open Source Data Harvesting/Mining system and highlight the need of collaborative/shared knowledge domain for the benefit of society and mankind.

Keywords

Open Access. Data Harvesting. Data Mining. OSDD

1 Introduction Data, the lowest level of Knowledge Pyramid which acts as a foundation for Knowledge abstraction and inferences, can be defined as a collection of information that represents variables having qualitative and quantitative properties. Data Mining refers to the extraction of hidden predictive information from data warehouse or from multiple distributed databases scattered over physical geographic boundaries. Data Harvesting refers to the congregation of data from heterogeneous sources into a particular database from where various inferences can be drawn. Open Access refers to the freedom for a researcher or technologist to access data from multiple heterogeneous database without the boundations of patents or copyright to search various hidden patterns and deduce the inferences that enriches the knowledgebase which can be redistributed. Open Access is the medium through which the power of Data harvesting or Data Mining can be achieved Our emphasis is to thoroughly examine the importance and benefits of Open Source Data Harvesting/Mining system and highlight the need of collaborative/shared knowledge domain for the benefit of society and mankind.

2 Background and related work Data Mining evolved from the days of local database when the concern of data was towards retrospective, static data delivery and data was stored in the form of flat files. With the result in Hardware processing capabilities and the technological advancement, the approach moved towards retrospective, dynamic data delivery at record level using Relational databases (RDBMS) and Structured Query Language (SQL) and highly scalable Microprocessors. This resulted in the development of large data-banks capable of storing huge amount of data and concept moved towards retrospective, dynamic data delivery at multiple levels using On-line Analytic Processing (OLAP), and finally resulted in the concept of prospective, proactive information delivery, the core of Data Mining. The major contribution toward achieving the concept of data mining goes to the scalar parallel processing microprocessors and the advancement in data access algorithms. Data Harvesting can be considered as a synonym of Data Mining; however, it is more concerned with the collection of heterogeneous data from multiple locations to a particular database. Open Access of data for Data harvesting or Mining is the paradigm shift from restricted access towards free access regardless of restriction of data either by patent or copyright. With the expansion of information gathering and dissemination, the concept of Open Access has increased many folds from past subtle access where data was confined to limited researchers to open society. Therefore, this is our requirement to make the concept of Open Access of Data/Database/ Warehouse more meaningful so that the restriction of copyright or patent which is quite daunting may not squeeze the area of collaborative work and innovative thoughts. Our primarily emphasis is to highlight the importance and benefits of Open Access in relation to data Harvesting/Mining. Open Source Drug Discovery (OSDD) can be considered as the best example of Open Source Data Harvesting/Mining. The aim of OSDD is to aggregate the knowledgebase of biological and genetic information available to scientists and researchers for the discovery of drugs by collaborating the work of scientists, researchers, doctors and other such contributors. 3. Need of Open Access Data Harvesting/Mining The need of Open Access itself derived from the fact that knowledge sharing and dissemination are the best solution to achieve the rapid growth and development of technology without incurring considerable expenditure and wider reach. This results in full access to repository for the purpose of aggregation, analysis and redistribution. The benefits of data mining can only be achieved through the concept of liberty of data access such as Internet and thus results in the development of various data mining tools such as Google Web Search Engine, WikiPedia, OSDD etc. The best example for this support is OSDD which is primarily focused on providing affordable heathcare for the common people. It has provided open access of complete sequencing of M. tuberculosis genome to the scientific community. It also provides a podium for many researchers, analysts, scientists and doctors to share their knowledge and views in the area of Biotechnology to solve problems related to tropical diseases like

malaria, tuberculosis, leishmaniasis, etc. The restrictions of copyright and patents constricted the scope of free knowledge generation and dissemination. However, Open Access provides free access and analysis of data by harvesting/mining data to generate knowledge patterns which can be further analysed by others to enhance the knowledge. 4. Working Concept of Open Access - Data Harvesting/Mining The principle behind Open Access in respect of Data Harvesting/Mining can be viewed under the concept of free access of data which can be downloaded, aggregated, analysed or improved without any restrictions in order to collaborate to provide innovative thoughts. The data from heterogeneous database separated geographically is collected to a central database i.e. data warehouse. The Researcher/Scientist/Analyst mines for hidden pattern through various data mining tools available for a particular type such as OSDD through the use of Open Source Web Technologies such as Linux OS, XML, GML etc. (See Fig. 4.1)

Fig. 4.1

The Open Access concept related to data harvesting/mining can be divided to four major heads:1. Data Harvesting – Internet is a vast source of heterogeneous information where data is scattered in an unorganized manner irrespective of any set pattern of schema. An agent is responsible to collect and store data from different locations in a central store or data warehouse which is a source for extracting data from hidden patterns. 2. Data Mining – There are numerous hidden patterns stored in the central store or data warehouse which needs to be extracted. This is accompanied through the use of Data Mining tools available for a particular application such as OSDD which is an Open Source model for Drug Discovery. 3. Knowledgebase Generation – The information thus generated from hidden data patterns is conceptualised, generalized and aggregated to form a knowledgebase which acts a base for deducing inferences and discovering novel methods for problem diagnostics. 5. Importance of Open Access: Data Harvesting/Mining There are numerous benefits of Open Access: Data Harvesting/Mining but few of them includes:1. Citation Index – Freely accessible Publications have better citations index and impact factor. This is of the fact that full page journal is available for download and further analysis. For eg. BioMed Central which has published 78503 articles of peer-reviewed biomedical research and German and English version of Deutsches Ärzteblatt’s medical scientific articles are available free of cost. 2. Affordable Technology – Since the data is freely accessible for download, analysis and further conceptualization. This drift results in a collaborative work where researchers, scientists and technologists can share their knowledge and data for a common cause to produce a novel affordable technology. For example OSDD, an open platform for collaborating biological and genetic information available to scientists for discovery of drugs. 3. Interdisciplinary Approach - Another importance of Open Access is the collaborative work of scientific community of diverse discipline. Example includes OSDD where Scientists, Researchers, Doctors, Technologists, IT Professionals are working together on a common platform for a common cause related to affordable healthcare. 6. Conclusion and future trend The study in this direction concludes that Open Access in reference to Data Harvesting/Mining has far greater impact than any restricted or non-Open Access Approach. The findings suggest that none of the research organization can afford to have full access of all available restricted journals or data available to publisher or information providers. Moreover, especially for developing countries where the restriction of resources can hamper the research due to non-Open Access approach may also result in

the higher cost of expenditure on research and subsequently expensive technology.. Therefore, the effort should be towards Open Access approach where freely accessible data or information can be utilized, downloaded, analysed and presented freely for distribution or publication. The future is already moving towards the Open Access approach where more and more publishers and information providers are providing full access to contents. OSDD is already moving in this direction by following an Open Access approach towards healthcare. A great step in this direction would be the consideration by research institutions as a part of their mandate to provide their institutional repository based on Open Access.

References 1. 2. 3. 4.

Gunther Eysenbach. The Open Access Advantage. Journal of Medical Internet Research; Vol 8, No 2 (2006) Stephan Mertens. Open Access: Unlimited Web Based Literature Searching. Dtsch Arztebl Int. v.106(43); Oct 2009 Matthew Cockerill. Data mining Open Access research. Biomed Central; September 8, 2003 Harnad, Stevan. Current Research Information Systems. Maximizing Research Impact through Institutional and National Open-Access Self-Archiving Mandates. Cogprints. 2006