Management, Access and Exploration of Large and Heterogeneous Data in the Web and Relational Databases. 1
TIN2006-15536-C02 Ricardo Baeza-Yates * Josep Lluis Larriba-Pey* Information Management Group DAMA-UPC Fundació Barcelona Media Universitat Politècnica de Catalunya Abstract This coordinated project studies the following problems: (i) data quality, (ii) information exploration and (iii) data mining of large data sets like the Web or in relational environments. The main goal is how to use the technology of both worlds (Web, relational) allowing to: (i) add quality to Web data, (ii) allow complex queries to process structured data, text and temporal sequences, and (iii) present the answers in a simple and concise form using data visualization techniques. As a first sub-goal we have the design and development of an extensible and modular framework to collect, process and store Web data. This framework will be used to study techniques for (a) augment data quality, (b) object storage, (c) a formal algebra that includes operations for object manipulation, retrieval, similarity, ranking, and statistics, allowing different types (text, graphs, logs, etc), (e) optimization algorithms for complex queries, and (f) visualization of objects and their relations. As second sub-goal we have the design and development of an exploring tool for relational databases, such that cross-referenced information can represented and visualized, enriching it with ranking by using Web based techniques. This module will have sub-modules for multiple files fusion for structured data and processing of temporal sequences coming from the Web, as well as visualization capabilities. Keywords: data organization, data mining, data quality.
1 Gestión, acceso y exploración de grandes volúmenes de datos heterogéneos relacionales.
en
la
* Email:
[email protected] * Email:
[email protected]
Web
y
en
entornos
1 Objectives of the project The starting hypothesis for this project is that the information technologies used in database management systems and in the web are complementary and can enrich each other. We propose the implementation of environments for research in the most innovating aspects that relate them. On one side, web technologies have matured significantly as to allow for their extension to aspects like the formulation of new algebras that allow launching queries to different data models, in a global and simple way. This takes us to think about hybrid technologies oriented to optimize for different data types, the research in access methods and the need to take into account data quality and its improvement to obtain better answers. While optimization and data quality are mature areas in the relational model, studying those aspects for web data may be important because heterogeneity and dispersion play an important role. On the other side, the relational model lacks the capability to explore the relationships between specific entities. Thus the need for taking into account aspects like ranking in the relational and web environments, and new metaphors for the representation of results may be topics of interest for the enrichment of the answers to queries.
2 Achievements The project has evolved in different directions at this moment achieving part of the objectives outlined above. On one side, the two groups have been developing infrastructure as the basis for their research. On the other side, the two groups have continued doing research, both starting research topics together, and on the different research areas of their expertise.
1
Scientific results
The work of the teams during the first two years of the project has been mainly oriented to the preparation of the infrastructure to allow for collaboration and to start some collaborative topics. In addition, some collaborative results have already been obtained and, in this respect, the groups have published two papers together in the area of distributed cache management for Question Answering [DoCIKM08, DoEur07], where the expertise in distributed computing and the need for infrastructure to manage web based Question Answering applications get together. Another means to collaborate has been the involvement in a joint European project, SEMEDIA (Search Environments for Media), where the groups participate
TIN20060000C02
bringing their knowledge in web and distributed processing together with other partners like BBC, University of Glasgow, and Joaneum Research in Austria. Also, within the framework of the project, the two groups have continued research on their own areas of interest that are included in the reported project. On one side, the FBM group has been working on the following topics:
a. Web data characterization.
To understand Web data better we have continued to do characterization studies on specific countries [ToCyb07,GrLAWeb08], across countries [BYTOIT07] and about duplicated content [BYJWE07].
b. Web data mining. We have continued doing novel Web data mining of different kinds, in particular search data represented by queries and clicks, where we have been pioneers [BYSPIRE06, PeLAWeb06, BYJASIST07, BYWebKDD07, NeWebKDD07, PoWWW08, NeIJIS08, MeLAweb08] and also link structure [BYISE07]. Based in this experience, we have designed a framework for Web mining prototyping [PeWSDM09] and that enabled us to shorten the development time from months to weeks. This has been exploited in recent results [BYWWW08].
c. Web data visualization. We have done research on the visualization of general data based in structure, content and usage, and applied those ideas to Web site usage visualization [PaIV07]. We also have explored the use of time in the visualization of search results [AlSIGCHI07, alSIGIR07], in the context of temporal information retrieval [AlForum07]. Also we have done studies on the use of different search paradigms and their interfaces [GoLAWeb08].
d.
Data anonymization. We have started exploring the issues of data privacy when query logs of a given Web site are published. This novel research received the best paper award in the first workshop focusing on Web privacy [PoPinKDD07].
On the other side, the DAMA-UPC group has been working on the following topics fulfilling the research objectives specified within the project:
e. Relational databases. In this respect, we have continued our work in relational processing with papers related to join processing [AgDKE2007, AgCIKM08] and to the optimization of queries for the relational model [MuDA07]. This topic has given rise to a PhD thesis
defended by V. Muntés-Mulero, member of DAMA-UPC during 2007 [MuTHESIS07].
f. Stream processing. In this area, we have been working on the
practical use of bloom filters for the efficient counting of events [AgDKE08] and the use of bitmaps for the efficient discrimination of events [PaDEXA08].
g. Graph data management. We have been working on technology for graph management (see next section) and the results of our research have shown up in a paper [MaCIKM07], a patent application [MaPAT08], a freely accessible Java library for graph management [DEX] and a MSc thesis defended by N. Martínez-Bazán [MaMSCT08]. We have also worked on demonstrating the capabilities of our technology with our results shown in [GoEDBT08] for BIBEX, a bibliographic exploration tool that can be accessed in [BIBEX]. This is the evolution of the main topic that relates relational data processing with the web. In fact, we consider graph databases as a different means to understand the relations represented by tables, which links our work with the other group in terms of graph technology.
h. Data cleansing and integration. Related to relational query
processing and graph data management, we have been working on the parallelization of record linkage methods [GuPriv08], and the cleansing of graph data [NiIDEAS07].
i. Data anonymization. This topic is new in our group and we have started acquiring some experience on it, to evolve towards the anonymization of large data sets, which is an open problem today. In this area, we have proposed different solutions to the improvement of the quality obtained by anonymization processes [PoMDAI08, PoSOF08, MaPAT08].
2
Technology results
The two groups are oriented to engineering research that allows obtaining software ready for technology transfer at some point in the research process. In this sense, the two groups are generating technology to allow doing research and as a product of the research done. The main technologies developed by the FBM group are all focused in Web mining:
TIN20060000C02
WET. WET is a Web exploration tool for Web site usage structure and data [WET]. The technology behind the first prototype of this software was presented in [PaIV07].
WIM. WIM is a tool that allows to mine Web information by using a high level programming language. For example, many applications just need a dozen of lines and then fast prototyping is possible. This tool is explained in detail in [PeTHESIS08] and the main ideas will be presented in [PeWSDM09].
PACHA. PACHA is a prototype of a temporal search engine, that allows to explore Web content using temporal information present in the text, including interesting visualizations [PACHA]. Demos have been presented in important conferences [AlSIGCHI07, AlSIGIR07].
The technology developed by the DAMA-UPC fulfills one important objective of the project, which is to obtain technology stemming from research that can be used for the benefit of society. The technology generated by the DAMA-UPC group is:
DEX. DEX is a high performance graph query engine that allows for the execution of queries to large out of core graphs [DEX]. The technology of this software is explained in a conference paper [MaCIKM07].
DAURUM. This is a de-duplication engine that has been implemented by the group previous to this project [DAURUM]. However, the research on the parallelization of the techniques used in DAURUM during this project, allows us to improve the technology of the software. This improvement of the technology stems from paper [GuPriv08].
CGO. The Carquinyoli Genetic optimizer. A tool that allows for the optimization of relational queries. CGO will be available in the future in the web of DAMA-UPC. This technology stems from paper [MuDAS07] and other previous papers related to it. Distributed Question Answering tool (DiQuA). This query engine is based on free software and can deal with English text to answer specific open queries. The software will be open to the public with the finalization of the PhD thesis linked with it. This technology stems from [DoCIKM08, DoEur07].
3
Technology Transfer
As a byproduct of the research and technology created, the groups are related to different industry partners with whom they collaborate intensely. The types of projects engaged with industry vary significantly depending on the group. The most important projects by the group of FBM are:
2006-2009. Visualization of processes and/or objects that have structure, content and usage in an integrated fashion, in particular information related to Web sites. This project has had the support of the company SPOC (ALTRAN group) through a Catalonia CIDEM funding and the City of Zaragoza. PI: Juan Carlos Dursteler. 50K euros.
2007-2008. Mining Web Queries. PhD Support by Yahoo! Research. PI: Ricardo Baeza-Yates. 48K euros.
The most important projects of group DAMA-UPC are related to the technologies that we are improving thanks to our research. In this respect, we have collaborated and initiated different projects that benefit our society in one way or another. In particular the projects engaged by the group are:
2007-2009. Development of an analysis system of Spanish research. TIN2007-30380-E. PI: Josep Lluis Larriba Pey. 86K euros.
2009-2010. Continuation of the creation of an analysis system of Spanish Research. TIN2007-30380-E. TIN2008-01202-E/TIN. PI: Josep L. Larriba-Pey. 310K euros.
2007-08. IBM has been sponsoring our research in the last ten years. The total amount of sponsoring exceeds 200K euros and has allowed us to finalize 2 PhD theses during this project.
2006-2007. Agreement with the notaries certification agency ANCERT, company of the notaries of Spain. The goal of this agreement is the deployment of a service based in DEX for the detection of fraud in real estate transactions and creation of companies. PI: Josep Lluis Larriba Pey. 260K euros.
2007. Agreement with the Council of Health and Consume of the Balearic Islands Government for the development of the cancer registry of the islands based in DAURUM. PI: Josep Lluis Larriba Pey. 29K euros.
TIN20060000C02
3 Results obtained The results obtained by the FBM group are described in the following items:
We believe that the objectives set for the project are being accomplished successfully considering that we have more than 20 publications. At the technological level we have done many software prototypes, and three of them are being publicly shown or transferred to industry.
We have published 6 journal papers, 10 international conference papers and several other publications. Some of our results have been presented in top-conferences with acceptance rate of 15% of less like WWW and ACM WSDM, which shows the quality and novelty of them. We also got a best paper award in a workshop.
Our research is targetted to analyze and visualize real data coming from the Web, hence it is directly applicable to many important industrial problems.
We have graduated two PhD thesis last year product of collaborative research with other institutions and we have 4 PhD students in their last year, one finishing in the first semester of 2009. So this clearly shows our capacity for forming new researchers.
We are collaborating with many groups. Among them we can mention the Univ. of Magdeburg in Germany (1 paper), the Univ. of California at David, USA (1 paper), the Federal Univ. of Minas Gerais in Brazil (3 papers), the Univ. of Chile (1 paper), the Univ. of Valparaíso in Chile (1 paper) and the Univ. of Lujan in Argentina (1 paper).
This management of the project has been focused on the research strategy and the efficient use of resources. This has impacted in the quality and quantity of the scientific and technological output.
The results obtained by the DAMA-UPC group can be enumerated as:
We believe that the objectives set for the Project are being accomplished at this moment both at the research level with 15 papers in international conferences and journals, one PhD and one MSc theses defended in the last two years, at the technology generation level with four pieces of software being generated and maintained by the group, and at the technology transfer level with our collaboration in different research and technology transfer projects
with industry and the administration.
The relevance of our research can be assessed by the amount of technology transfer that we are doing based on it. Out of our research more than half of the budget of DAMA-UPC comes from technology transfer stemming from our own research.
As we said above, our scientific production and technological production prove our involvement with research and with the projection of this research for the benefit of our society.
We are very capable of forming people for research. At this moment the group has generated 4 PhD thesis and one more on its way.
The work done in cooperation with other research groups can be summarized as: University of Cyprus (2 papers) with the Research Council of Spain CSIC (6 papers), with IBM (4 papers) and with FBM (2 papers).
The management of the project has allowed potentiating the amount of research being done because it has allowed the researchers to improve the use of their time.
In addition, the coordination of the project is allowing us to steadily integrate the work of the two groups. We believe that this will eventually allow us to come up with a significant amount of research together in the next few years, as we have already published two papers together and we are currently working in a couple of joint publications.
4 References [AgDKE08] J. Aguilar-Saborit, P. Trancoso, V. Muntes-Mulero, J.L. Larriba-Pey. Dynamic adaptive data structures for monitoring data streams. Vol 66, Issue 1, pp-92-115, Data & Knowledge Engineering, Elsevier, July 2008. [AgDKE07] J. Aguilar-Saborit, V. Muntés-Mulero, C. Zuzarte, J.-L. Larriba-Pey. Star join revisited: performance internals for cluster architectures. Vol 63, Issue 3, pp. 997-1015, Data & Knowledge Engineering, Elsevier, Dec 2007. [AgCIKM08] J. Aguilar-Saborit, M. Jalali, D. Sharpe, V. Muntés-Mulero: Exploiting pipeline interruptions for efficient memory allocation. In CIKM 2008: pp. 639-648. ACM, Napa Valley, USA, 2008.
TIN20060000C02
[AlSIGIR07] Omar Alonso, Michael Gertz, Ricardo Baeza-Yates. Search results using timeline visualizations (Demo). SIGIR 2007, p. 908, July 2007. [AlSIGCHI07] Omar Alonso, Ricardo Baeza-Yates, Michael Gertz. Exploratory Search Using Timelines (Poster). Designing and Evaluating Interfaces to Support Exploratory Search Interaction, SIGCHI 2007 Workshop, San Jose, USA, April 2007. [AlForum07] Omar Alonso, Michael Gertz and Ricardo Baeza-Yates. On the Value of Temporal Information in Information Retrieval, ACM SIGIR Forum 41(2), 35-41, December 2007. [AlTHESIS08] Omar Alonso. Temporal Information Retrieval. PhD Thesis. Advisors: Michael Gertz and Ricardo Baeza-Yates. Univ. of California at Davis, USA. August 2008. [BYSPIRE06] Ricardo Baeza-Yates, Liliana Calderón, Cristina González. The Intention Behind Web Queries. SPIRE 2006, LNCS Springer, Glasgow, Scotland, October 2006. [BYWebKDD07] Ricardo Baeza-Yates, Alvaro Pereira, Nivio Ziviani. Understanding Content Reuse on the Web: Static and Dynamic Analyses. In Advances in Web Mining and Web Usage Analysis. Springer LNCS 4811, O. Nasraoui, M. Spiliopoulou, J. Srivastava, B. Mobasher, B. Masand (Eds), 227-236, 2007. [BYISE07] Ricardo Baeza-Yates, Carlos Castillo and Vicente López. Pagerank Increase under Different Collusion Topologies. In Internet Search Engines, Ravi Kumar Jain B, editor, Icfai University Press, India, 163-183, 2007. [BYJASIST07] Ricardo Baeza-Yates, Carlos Hurtado, Marcelo Mendoza. Improving Search Engines by Query Clustering, Journal of the American Society of Information Science and Technology - JASIST 58 (12), 1793-1804, 2007. [BYTOIT07] Ricardo Baeza-Yates, Carlos Castillo, Efthimis Efthimiadis. Characterization of National Web Domains. ACM Transactions on Internet Technology 7, Issue 2, 2007. [BYJWE07] Ricardo Baeza-Yates, Carlos Castillo. Crawling the Infinite Web: Five Levels are Enough. Journal of Web Engineering 6, 49-72, 2007. [BYWWW08] Ricardo Baeza-Yates, Alvaro R. Pereira Jr., Nivio Ziviani. Genealogical trees on the Web: a search engine user perspective. In WWW
2008, Beijing, China, 367-376, 2008. [BIBEX] BIBEX on-line bibliographic exploration. www.dama.upc.edu/bibex . [DAURUM] DAURUM on line documentation. http://www.dama.upc.edu/daurum_an.php?language=english [DEX] DEX java library. www.dama.upc.edu/dex [DoCIKM08] D. Dominguez-Sal, M. Surdeanu, J. Aguilar-Saborit, J. LarribaPey: Cache-aware load balancing for question answering. In CIKM 2008: pp. 1271-1280. Napa Valley, USA, ACM, 2008. [DoEur07] D. Dominguez-Sal, J. L. Larriba-Pey and M. Surdeanu. Multi-layer Collaborative Cache for Question Answering. In Proc of Europar 2007. Springer LNCS 4641, pp. 287-298. Rennes, France, 2007. [GoEDBT08] S. Gómez-Villamor, G. Soldevila-Miranda, A. Giménez-Vañó, N. Martínez-Bazan, V. Muntés-Mulero, J.-L. Larriba-Pey: BIBEX: a bibliographic exploration tool based on the DEX graph query engine. In EDBT 2008, pp. 735-739. ACM, Nantes, France, 2008. [GoLAWeb08] Cristina González-Caro, Marcos Mari-Carmen, Liliana Calderon, Ricardo Baeza-Yates. Human or Automatic Answers? A User's Based Study (short paper). Latin American Web Conference, Vila Velha, Brazil, Oct 2008. [GrLAWeb08] Eduardo Graells, Ricardo Baeza-Yates. Evolution of the Chilean Web: A Larger Study (short paper). Latin American Web Conference, Vila Velha, Brazil, Oct 2008. [GuPriv08] J. Guisado-Gámez, A. Prat-Pérez, J. Nin, V. Muntés-Mulero, J. Larriba-Pey: Parallelizing Record Linkage for Disclosure Risk Assessment. Privacy in Statistical Databases, pp. 190-202. Springer LNCS 5262, 2008. [MaCIKM07] N. Martinez-Bazán, J. Nin, S. Gómez-Villamor, M.-A. SánchezMartínez, V. Muntés-Mulero, J.-L. Larriba-Pey. DEX: High-Performance Exploration on Large Graphs for Information Retrieval. In CIKM’07, pp. 573-582, ACM, Lisbon, Portugal, 2007. [MaPAT08] N. Martínez-Bazán, S. Gómez-Villamor, V. Muntés-Mulero, J. L. Larriba-Pey. Procedure to represent and manipulate multi-graphs based in bit maps. Registry date 22/7/2008. Registry number P200802251. Evaluation pending.
TIN20060000C02
[MaMSCT08] N. Martínez-Bazán. DEX: High Performance Graph Databases. MSc thesis. Polytechnic University of Catalonia. September 2008. [MeLAWeb08] Marcelo Mendoza, Ricardo Baeza-Yates. A Web Search Analysis considering the Intention behind Queries. Latin American Web Conference, Vila Velha, Brazil, Oct 2008. [MuDAS07] V. Muntés-Mulero, J. Aguilar-Saborit and J.-Ll. Larriba-Pey. Improving Quality and Convergence of Genetic Query Optimizers. In DASFAA 2007 proceedings. LNCS-4443, pp. 6-17, June 2007. [MuTHESIS07] V. Muntés-Mulero. Large Join Query Optimization. PhD thesis. Polytechnic University of Catalonia. May 2007. [NeWebKDD07] David Nettleton, Liliana Calderón, Ricardo Baeza-Yates. Analysis of Web Search Engine Query Session and Clicked Documents. In Advances in Web Mining and Web Usage Analysis. Springer LNCS 4811, O. Nasraoui, M. Spiliopoulou, J. Srivastava, B. Mobasher, B. Masand (Eds), 207-226, 2007. [NeIJIS08] David Nettleton, Ricardo Baeza-Yates. Web Retrieval: Techniques for the Aggregation and Selection of Queries and Answers, International Journal of Intelligent Systems 23(12), 1223-1234, 2008. [NiIPC07] J. Nin, J- Pont-Tusset, P. Medrano-Gracia, J.L. Larriba-Pey, V. Muntes-Mulero. Increasing Polynomial Regression Complexity for data Anonymization. In IPC’07, pp. 29-34, IEEE, 2007. [NiIDEAS07] J. Nin, V. Muntés-Mulero, N. Martínez-Bazán and J.-Ll. Larriba Pey. On the Use of Semantic Blocking Techniques for Data Cleansing and Integration. In IDEAS’07, pp. 190-198. IEEE, 2007. [PACHA] PACHA: Temporal retrieval systems with visualization metaphors. http://wwwcsif.cs.ucdavis.edu/~alonsoom/time.html. [PaDEXA08] M. Papas, J. Larriba-Pey, P. Trancoso: Categorized Sliding Window in Streaming Data Management Systems. In DEXA 2008: pp. 625-634. Springer LNCS 5181, 2008. [PaIV07] V. Pascual, J. C. Dürsteler WET: A prototype of an exploratory search system for web mining to assess usability. in IV '07: 11th International Conference in Information Visualization, pp. 211-215, 2007. [PeLAWeb06] Alvaro Pereira Jr., Ricardo Baeza-Yates, Nivio Ziviani. When and where Web duplicates occur. In IV Latin American Web Congress, IEEE
CS Press, Puebla, México, 127-134, October 2006. [PeTHESIS08] Alvaro Pereira Jr. A Model for Fast Web Mining Prototyping: Design, Algebra, Implementation and Use Cases. PhD Thesis. Advisors: Ricardo Baeza-Yates and Nivio Ziviani. Federal Univ. of Minas Gerais, Brazil, October 2008. [PeWSDM09] Alvaro Pereira, Ricardo Baeza-Yates, Nivio Ziviani, Jesus Bisbal. A Model for Fast Web Mining Prototyping. In ACM WSDM 2009, Barcelona, February 2009. [PoPinKDD07] Barbara Poblete, Myra Spiliopoulou and Ricardo Baeza-Yates. Website Privacy Preservation for Query Log Publishing. In First ACM SIGKDD International Workshop on Privacy, Security, and Trust in KDD (PinKDD'07), F. Bonchi, E. Ferrari, B. Malin and Y. Saygin (Eds.), Springer LNCS 4890, San Jose, California, USA, 80-96, 2007. Best paper award. [PoWWW08] Barbara Poblete, and Ricardo Baeza-Yates. Query-sets: using implicit feedback and query patterns to organize Web documents. In WWW 2008, Beijing, China, 41-50, 2008. [PoMDAI08] J. Pont-Tuset, J. Nin, P. Medrano-Gracia, J. Larriba-Pey, V. Muntés-Mulero: Improving Micro-aggregation for Complex Record Anonymization. MDAI 2008: 215-226. Springer LNCS 5285, 2008. [PoSOF08] J. Pont-Tuset, P. Medrano-Gracia, J. Nin, J.-L. Larriba-Pey, V. Muntés-Mulero: ONN the Use of Neural Networks for Data Privacy. In SOFSEM 2008. Springer LNCS 4910, pp. 634-645, 2008. [ToCyb07] Gabriel Tolosa, Fernando Bordignon, Ricardo Baeza-Yates, Carlos Castillo. Characterization of the Argentinian Web. Cybermetrics 11(1), July 2007. [WET] WET: Web Exploration Tool. http://www.infovis.net/printMag.php? num=193&lang=1