Knowledge Extraction and Integration - PlanetData

3 downloads 171546 Views 110KB Size Report
for fact discovery and acquisition of knowledge from large amounts of data, .... a) Visual analytics: Advanced, scalable visual interfaces are necessary for.
Knowledge Extraction and Integration using Automatic and Visual Methods Vedran Sabol, Roman Kern, Barbara Kump, Viktoria Pammer, Michael Granitzer vsabol|rkern|bkump|vpammer|[email protected] Know-Center, Inffeldgasse 21a, 8010 Graz, Austria

Introduction Techniques for efficiently accessing, analyzing and presenting large amounts of dynamic, heterogeneous data, with the goal of acquiring new knowledge and facts, play an increasingly important role in various application domains, such as media, patent databases, scientific publication repositories, medical databases etc. We identify two areas of research which, we believe, can provide significant improvements and benefits for fact discovery and acquisition of knowledge from large amounts of data, particularly when combined and applied together: • There is a large potential in combining the knowledge present in structured, semantic data, such as open linked data (DBPedia, GeoData, Friend of a Friend, etc.), with unstructured information, such as human readable content. Integration of structured and unstructured data will facilitate the development of advanced data mining and knowledge extraction techniques. • Visual interfaces are a powerful means for exploring and analyzing large amounts of data and providing access to knowledge and facts. Additionally, visual interfaces can be used as a common discourse platform supporting collaboration and networking, and empowering users to share their insights with co-workers and contribute knowledge into the Linked Data Cloud.

Vision We envision an approach combining automatic methods and visual techniques for discovery of facts and deriving new knowledge from massive data. Besides data and content, which contain implicit knowledge, two explicit knowledge sources shall be utilized in the process: semantic databases and human expertise. Patterns and facts, which are either discovered by algorithms or unveiled through visual analysis, shall be validated by humans and integrated into Linked Data repositories through dedicated visual interfaces (Figure 1). The advantage of integrating the newly acquired facts and knowledge is twofold: • it can be used by extraction and mining algorithms to improve their performance. • it is made available to other people - either direct collaborators or people independently working on related topics. It should be noted that the proposed scenario borrows from the ideas of the social Web, whereby instead of publishing new user-generated content on various Web platforms, the focus here is on deriving and sharing of new facts and knowledge through integration with existing knowledge bases (i.e. Linked Data).

LOD Cloud Data/Content Repositories

Data Content

Validated Facts and Knowledge

Knowledge

Extraction of Sematics Data Mining

Visual Interfaces Knowledge, Extracted Facts, Patterns

Pattern recognition Hypothesis generation & validation Deriving new knowledge

Figure 1: Extraction of new knowledge and facts and their integration into the Linked Data Cloud using a combination of automatic and visual methods.

Semantic Enrichment and Data Mining Extraction of semantics from unstructured data such as text is a well established area of research, however extraction and disambiguation of facts still poses significant challenges. Current information extraction systems [Etzioni et. al. 2010] are capable of extracting simple, frequent factual relationships from open data sets like the Web, but extraction and disambiguation of non-frequent entities for specific domains, and integration of the extracted semantics into Linked Data repositories remains a major challenge. Integration of different, ontologically encoded knowledge can be achieved through ontology alignment or ontology merging algorithms [Euzenat & Shvaiko 2007], however approaches involving human intervention through the use of visual interfaces may help alleviate algorithm limitations [Granitzer et al. 2010]. Use of semantic information for advancing data mining methods appears as a promising research topics: it has been shown by several authors that semantic information can be successfully used in mining tasks, such as for example in text clustering [Hotho et al. 2002], [Szczuka et al. 2011]. Our goal is to develop methods for extraction of semantic information and facts, and enrichment of human readable content, such as governmental information, scientific publications, patent databases, media articles, user generated content etc. Newly extracted knowledge and facts shall be integrated with the Linked Open Data Cloud. The resulting methods will be based on natural language processing (NLP) and machine

learning techniques, such as scalable text clustering [Muhr et al. 2010a] and cluster labelling [Muhr et al. 2010b]. It should be noted that in the above described setup, several challenges deserve particular attention: entity disambiguation techniques [Kern et al. 2010], [Kern et al. 2011b], information diffusion and reuse [Kern et al. 2011a], and information quality and trust [Lex et al. 2010].

Visual Access and Analysis Visual Analytics [Thomas & Cook 2011, Keim et al. 2010] is a research field focusing at supporting humans in analytical reasoning over massive data sets using visual interfaces. It strives to effectively integrate human knowledge and experience into complex analytical processes by suitably combining machine processing with visual analysis methods. Our goal is to develop visual analysis techniques for large unstructured repositories and for Linked Data repositories. Visual interfaces built atop automatic techniques, shall provide an intuitive access to data and knowledge, pattern recognition possibilities and integration of the discovered, validated facts back into the Linked Data Cloud. Ontological information can be used to improve the presentation of visual interfaces in a variety of ways [Paulheim & Probst 2010]. Ontologically described user interface components will be easily adaptable to various data types and sources, allowing users to explore, analyse and manipulate information delivered by semantic enrichment, mining and integration methods. Visual components for assisting and simplifying data binding from Linked Open Data repositories to user interface elements shall also be considered. A visualization system providing methods for exploring and analysing massive repositories, and for integrating the discovered knowledge into semantic knowledge repositories, will be composed of a variety of visualization components which can be grouped into two categories [Granitzer et al. 2011]: discovery components, where the information flow is from the repository to the users, and description components suitable for expressing knowledge and integrating it with the Linked Data. Discovery Components serve analytical purposes such as explorative navigation and discovery of new insights. Examples are charting components (for example bar, pie, line and spider charts) and advanced analytical components targeting abstract information such as graphs and networks, topical relatedness, change and temporal information etc. [Sabol et al. 2009]. Selection of suitable visual metaphors for a particular task and data shall be performed (semi-)automatically based on available best practices [Lengler & Eppler 2007]. Description components are customizable knowledge visualization [Eppler & Burkhard 2004], [Bertschi et al. 2011] metaphors empowering users to intuitively express and communicate facts and knowledge. A new visualization shall initially be created as an empty "skeleton" of a chosen metaphor, where the user applies a "Builder Tool" to construct a visual representation expressing (newly acquired) knowledge. In doing so the expressed facts shall be integrated into the Linked Data Cloud. Also, if made publicly available, such a visual representation would provide a platform for transferring and communicating knowledge to a broader audience which can extend it by integrating additional knowledge.

Challenges We compile an (incomplete) list of challenges, grouped into five categories, which need to be addressed for realization of the above described vision: 1. Data and information: a) Information quality: Through the advent of the Social Web and user generated content information quality and trustworthiness gain a prominent role. The same is true for user generated knowledge. b) Information reuse and diffusion: In large scale collaboration scenarios tracking diffusion of information and knowledge, and identification of their reuse becomes important. c) Security and ownership: Access control and security mechanisms are hard to address in large, distributed data and knowledge bases . d) Change and evolution: Identification of trends and temporal patterns, and handling of high rates of change in data, content and knowledge are necessary when dealing with dynamically changing repositories. e) Data integration: Binding to different data and knowledge repositories, and transformation of data into the required form becomes indispensable when dealing with heterogeneous infromation. 2. Storage infrastructures: efficiently applying distributed Big Data Storage infrastructures, such as Hadoop/HDFS, and integrating them with traditional infrastructure (relational databases) and with arising models, such as cloud computing. 3. Algorithms: To cope with huge data amounts scalable algorithms need to be developed. Two approaches appear promising in this context: a) Distributed Algorithms. for example based on Hadoop (MapReduce + Google File System), process data on a large number of computing nodes which may be placed in geographically separate locations. b) Streaming Algorithms process a huge stream of information in a few (one) passes and with limited resources (memory), by building an approximate summary/aggregation of the data. 4. User interfaces and visualization: a) Visual analytics: Advanced, scalable visual interfaces are necessary for analysis and exploration of large data and knowledge databases. Usage of semantic descriptors holds the promise for automatically binding to semantically described data. b) Mobility: The mobile boom redefines the requirements on GUI design and interactivity. c) Context: Besides considering the classical user context (e.g. task, preferences), user context can be extended with sensory data, such as those provided by mobile devices. d) Collaboration: New collaboration possibilities arise through increased mobility and permanent broadband network access. 5. Use und Commercialization: For various types of content (such as videos, music, news etc.) established commercial and non-commercial utilization models exist. For linked data this is not the case (yet). Sustainable eco-systems for LOD utilization need to be built and established (possibly learning from the lessons delivered by the social Web and user generated content).

Each of the challenges presents a research topic on its own. Our interests have been focusing on topics such as information reuse, quality and evolution, consideration of user context, and visual analytical interfaces (including mobile applications). To bring forward the presented vision and address the challenges in a satisfactory manner, bundling of resources and competencies appears as a natural way to go.

References [Bertschi et al. 2011] Bertschi, S., Bresciani, S, Crawford, T., Goebel, R., Kienreich, W., Lindner, M., Sabol, V., Vande Moere, A., (2011): What is Knowledge Visualization? Opinions on Current and Future State, in Proceedings of the 15th International Conference Information Visualisation (IV'11) [Eppler & Burkhard 2004] Eppler, M.J., Burkhard, R.A., Knowledge Visualization – Towards a New Discipline and its Fields of Application, ICA Working Paper #2/2004, University of Lugano, Switzerland, 2004. [Etzioni et al. 2010] Etzioni, O., Banko, M., Soderland, S., & Weld, D. S. (2008). Open information extraction from the web. Communications of the ACM, 51(12), 68. doi:10.1145/1409360.1409378 [Euzenat & Shvaiko 2007] Euzenat, J., Shvaiko, P., (2007): Ontology matching, SpringerVerlag. [Granitzer et al. 2010] Granitzer, M., Sabol, V., Onn, K.W., Lukose, Dickson., Tochtermann, K. (2010): Ontology Alignment – A Survey with Focus on Visually Supported SemiAutomatic Techniques, Future Internet, Volume 2, Issue 3, 238-258, MDPI AG [Granitzer et al. 2011] Granitzer, M., Sabol, V., Kienreich, W., Lukose, D., Onn, K.W. (2011): Visual Analyses on Linked Data – An Opportunity for both Fields The 2011 STI Semantic Summit, Riga, Latvia [Hotho et al. 2002] Hotho, A., Maedche, A. and Staab, S., (2002): “Ontology-based text document clustering”, Kunstliche Intelligenz, 16 (4), pp 48-54. [Keim et al. 2010] Keim, D., Mansmann, F., & Thomas, J. (2010). Visual analytics: how much visualization and how much analytics? ACM SIGKDD Explorations Newsletter, 11(2), 5–8. ACM. [Kern et al. 2010] Kern, R., Muhr, M., Granitzer, M. (2010): KCDC: Word Sense Induction by Using Grammatical Dependencies and Sentence Phrase Structure, in Proceedings of SemEval-2, pages 351-354. [Kern et al. 2011a] Kern, R., Seifert, C., Zechner, m., Granitzer, M., (2011): Vote/Veto Meta-Classifier for Authorship Identification, 3rd International Competition on Plagiarism Detection [Kern et al. 2011b] Kern, R., Zechner, M., Granitzer, M., (2011): Model Selection Strategies for Author Disambiguation, IEEE Computer Society: 8th International Workshop on Text-based Information Retrieval in Procceedings of 22th International Conference on Database and Expert Systems Applications (DEXA 11), pages 155-160. [Lengler & Eppler 2007] Lengler R., Eppler M. (2007): Towards A Periodic Table of Visualization Methods for Management. IASTED Proceedings of the Conference on Graphics and Visualization in Engineering (GVE 2007), Clearwater, Florida, USA.

[Lex et al. 2010] Lex, E., Khan, I., Bischof, H., Granitzer, M., (2010): Assessing the Quality of Web Content, Proceedings of the ECML/PKDD Discovery Challenge. [Muhr et al. 2010a] Muhr, M., Sabol, V., Granitzer, M. (2010): Scalable Recursive TopDown Hierarchical Clustering Approach with implicit Model Selection for Textual Data Sets, 7th International Workshop on Text-based Information Retrieval, in Proceedings of 21th International Conference on Database and Expert Systems Applications (DEXA 10), IEEE. [Muhr et al. 2010b] Muhr, M., Roman Kern R., Granitzer, M., (2010): Analysis of Structural Relationships for Hierarchical Cluster Labeling, in Proceeding of the 33rd international ACM SIGIR Conference on Research and Development in information Retrieval, pages 175-185, ACM [Paulheim & Probst 2010] Paulheim, H., Probst, F., (2010): Ontology-Enhanced User Interfaces: A Survey, International Journal on Semantic Web and Information Systems (IJSWIS), 6(2). [Sabol et al. 2009] Sabol, V., Kienreich, W., Muhr, M, Klieber, W., Granitzer, M., (2009): Visual Knowledge Discovery in Dynamic Enterprise Text Repositories, Proceedings of the 13th International Conference on Information Visualisation (IV09), IEEE. [Szczuka et al. 2011] Szczuka, M., Janusz, A., Herba, K., (2011): Clustering of rough set related documents with use of knowledge from DBpedia, Proceedings of the 6th international conference on Rough sets and knowledge technology RSKT'11, pages 394-403. [Thomas & Cook 2005] Thomas, J. J., Cook, K. A. (2005). Illuminating the Path: The Research and Development Agenda for Visual Analytics (p. 186). IEEE Computer Society. Retrieved from http://nvac.pnl.gov/agenda.stm.

Suggest Documents