Using Association Rules for Monitoring Meta-Data Quality in Web Portals Marcos Aur´elio Domingues1 , Al´ıpio M´ario Jorge2 , Carlos Soares2 1
LIACC-NIAAD – Universidade do Porto Rua de Ceuta, 118, Andar 6 – 4050-190 Porto, Portugal 2
LIACC-NIAAD – Faculdade de Economia – Universidade do Porto Rua de Ceuta, 118, Andar 6 – 4050-190 Porto, Portugal
[email protected],
[email protected],
[email protected]
Abstract. We present a system to monitor the quality of the meta-data used to describe content in web portals. It implements meta-data analysis using statistics, visualization and association rules. The system enables the site’s editor to detect and correct problems in the description of contents, thus improving the quality of the web portal and the satisfaction of its users. We have developed this system and tested it on a Portuguese portal for business executives.
1. Introduction The aim of many web portals is to select, organize and distribute content (information or other services) in order to satisfy its users/customers. The methods to support this process are to a large extent based on meta-data (such as keywords, category and other descriptors) that represent and describe content. For instance, the search engines included in portals often take into account keywords that are associated with the content to compute their relevance to a query. Likewise, the accessibility of content by navigation depends on their position in the structure of the portal, which is usually defined by a descriptor (e.g., category). Meta-data is usually filled in by the authors who insert content in the portal. The publishing process, which goes from the insertion of content to its actual publication on the portal is regulated by a workflow. The complexity of this workflow varies: the author may be authorized to publish content directly; alternatively content may have to be analyzed by one or more editors, who authorize its publication or not. Editors may also suggest changes to the content and its meta-data, or make the changes themselves. In the case where there are many different authors or the publishing process is less strict, the meta-data may describe content in a way that is not suitable for the purpose of the portal, thus degrading the quality of the services provided. For instance, a user may fail to find relevant content using the search engine if the set of keywords assigned to it are inappropriate. Thus, it is essential to monitor the quality of meta-data describing content to ensure that the collection of content is made available in a structured, inter-related and easily accessible way to the users. In this paper we present a system to monitor and control the quality of the metadata used to describe content in web portals. We assume that the portal is based on a Content Management System, which stores and organizes content using a database. The core of our system is a set of metrics and visualizations, that objectively assess the quality of the meta-data and indirectly the quality of the service provided by the portal.
2. Methodology to monitor the quality of content meta-data In this section we summarize the methodology to monitor the quality of meta-data in web portals that was proposed by [Soares et al. 2005]. Traditionally, the publishing process involves authors and editors (Figure 1). Authors insert content in the web portal, control its access permissions, define the values of the meta-data describing it, and integrate it into the organization of the portal (for instance, by locating it in a hierarchy of categories). Editors monitor the publishing process and authorize the publication of content. This process is very important to the success of a web site. For instance, an inadequate classification of content may make it practically invisible to users who are interested in it. Content is made available on the portal and the user accesses are recorded in the web access logs. [Soares et al. 2005] added two new elements (dashed lines) to the process. An Evaluation Tool, called EdMate, computes a set of metrics and visualizations that assess the quality of the content meta-data and generates dynamic web reports. Based on those reports, the Super-Editor may identify necessary corrections or opportunities for improving the meta-data. This can be done directly or by making suggestions to authors.
Figure 1. Publishing process proposed by [Soares et al. 2005].
3. Architecture of the EdMate system The architecture of the EdMate system is presented in Figure 2. The Assessment module periodically computes the values of all the metrics and statistical indicators, and generates the corresponding visualizations. The values and the plots are then stored in the Metrics database. The values and visualizations in this database are used by the Presentation module to generate the Hyper reports, which are accessed using a web browser.
Figure 2. Architecture of the EdMate system.
Based in data quality principles [Pipino and Wang 2002], we have designed more than 60 metrics. Table 1 presents a few examples for illustration purposes. The values of the metrics and the different types of visualization (see example in Figure 3) are represented and summarized in web reports to be used by the super-editor and the editors/authors. The reports allow the identification of deviations and trends that may justify editorial actions. Note that the functions used to compute the metrics and the visualizations go from very simple statistics to more complex methods, such as association rules [Agrawal and Srikant 1994]. Additionally, note that the metrics can be based on other kinds of data (e.g., web access logs), besides meta-data. Table 1. Name and description of a few metrics.
Name: Length of value Description: Number of characters in the value of a descriptor. Extremely large or small values may indicate inadequate choice of values to represent the content. Name: Association between values Description: The confidence level of an association rule A → B is an indicator of whether the set of values A make the set of values B redundant. The higher the value, the more redundant B is expected to be. Name: Frequency in search Description: Frequency of the values of a descriptor in the web access logs (e.g., the frequency of a search using a given keyword).
4. Case study In this section we describe the application of the EdMate system to PortalExecutivo.com (PE), a Portuguese web portal targeted to business executives. The business model is subscription-based, which means that only paying users have full access to content through web login. However some content is freely available and users can freely browse the site’s structure. Content is provided not only by PE but also by a large number of partners. The goal of PE is to facilitate the access of its members to relevant content. Value is added to the contributed content by structuring and interrelating them. This is achieved by filling in a quite rich set of meta-data fields, including keywords, category, source, among others. Therefore, the problem of meta-data quality is essential for PE. Since the results of queries to the search engine are affected by the quality of the keywords used to describe content, here we focus on this meta-data field. The meta-data used is relative to the period April/September 2004. During this period, 124,287 hits were recorded in the logs, concerning 17,196 different content items. The Assessment module also generates association rules. Each transaction is the set of keywords used in a content. These are used to define the association between values metric (indicates sets of keywords that are used together). This metric can be graphically displayed as shown in Figure 3. Dots represent keywords and arrows represent significant association. We observe that often a general keyword (e.g., fiscality - fiscalidade) is associated with a more specific one (e.g., international taxation - tributac¸a˜ o internacional). This implicit structure of the keywords, unveiled by the discovered association rules, enables the detection of incorrect descriptions. For instance, a content that has
a more specific keyword (e.g., international taxation) not associated with a more general keyword (e.g., fiscality). Besides this, the associations rules enable the editors to understand the meta-data edition behaviour of the authors. Other kinds of analyses are presented in [Soares et al. 2005].
Figure 3. Relationships between keywords obtained using association rules.
5. Conclusions and future work Many web portals have a distributed model for the contribution of content. No matter how strict the publishing process is, low quality meta-data will sometimes be used to describe content. This degrades the quality of the services provided to the users. In this paper we present a system to monitor the quality of meta-data used to describe content in web portals. We have used the system on data from an existing web portal. This preliminary application allowed the detection and correction of poor editorial practices. Besides enabling the assessment of the quality of the meta-data, it enables the monitoring of corrective actions and it provides an up-to-date perspective of the publishing process. As future work, we plan to make this system an independent module for web site monitoring, analysis and adaptation. We plan to generalize the database schema, and to develop other kinds of analysis (besides content meta-data) to address other problems, such as: intrusion detection, planning of marketing strategies, content recommendation.
Acknowledgements Fundac¸a˜ o para Ciˆencia e Tecnologia (SFRH/BD/22516/2005), PortalExecutivo.com, POSC/EIA/58367/2004/Site-o-Matic Project and FEDER.
References Agrawal, R. and Srikant, R. (1994). Fast algorithms for mining association rules. In Bocca, J. B.; Jarke, M. and Zaniolo, C., editors, Proceedings 20th International Conference on Very Large Data Bases, VLDB, pages 487–499. Pipino, L. L.; Lee, Y. W. and Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4). Soares, C., Jorge, A. M., and Domingues, M. A. (2005). Monitoring the quality of metadata in web portals using statistics, visualization and data mining. In Proceedings of EPIA 2005, Lecture Notes in Computer Science, Vol. 3808, pages 371–382.