Using a taxonomy based fingerprint: classification and recognition of the academic webspace 1 ` David REYMOND1 and Nathalie PINEDE
MICA-GRESIC Universit´e Bordeaux 3 - Michel de Montaigne 33607 Pessac
[email protected] [email protected]
Abstract. Providing indicators for websites is challenging1 . Traditional webometric studies use case studies of academic web spaces to develop theories and powerful tools. The academic Web space is huge and diverse: departments, laboratories, scholars and many other sub components may publish web sites. Hence, in order to deliver good results (linking studies between educational departments, publication statistics in resource portals) in-depth qualitative preparation is necessary to identify subclasses of websites. We propose a way to produce a ’fingerprint’ for homepages and use it in order to classify or retrieve subsites making the distinction between various kinds of sites and allowing detailed webometric studies. Our work is lean on a taxonomy recursively built using an initial set of websites that allows the automatic categorisation of homepages via their projection onto this informational space. The taxonomy is organized at three primary levels: activities, profiles (or targets of websites), tools and web-specific contents. We describe the prototype and give results for the academic webspace of French universities. As our prototype is specific to university websites we also discuss relevant perspectives and future work.
1
Introduction
The academic web in France is very complex: often (but partially [1]) organised by domain name endings (.u-town.fr or .univ-town.fr) websites may originate from educational departments, laboratories, library portals or other components of the organization. This offers a useful case study for our experiment as this knowledge will be used to customize our approach. Each website is different from each other in its targets or information delivering objectives and they are too numerous2 to be manually analyzed. Dealing with such quantity, decisional level 1
2
Acknowledgments: this interdisciplinary research has been supported by RAUDIN project, Feder N◦ 31462, Conseil R´egional Aquitaine and University of Bordeaux 3. We could count more than four hundred websites in 2010 for the four universities of Bordeaux in the several DNS u-bordeaux{1–4}.
need the use of indicators to pilot web communication process. The construction of such indicators appears to be really difficult in an automatic way. Our approach is different from classical classification ways. We use a taxonomy [5] that organizes the information presented by these sites into three categories: the activities of the organization, the web-based tools, and the targets of the sites (profiles). In a second process the taxonomy combined with a spider that extracts specific features of a page, is used to calculate a fingerprint. The third part will use centroid of manual generated classification to develop a reference map in order to compare all the subsites of the editorial zone and calculate the distance to the centroids to label each site with the centroid class using the minimal distance.
2
Fingerprints and automatic classification overview
A website can be represented by a set of feature vectors using term frequency [3]. The authors use a centroid-based classification approach that gives satisfactory results. This potential approach was improved by the selection of some specific features of web pages, such as the title and metatags [6] in addition to body text due to the lack of this feature (only 20% in our corpus) in a lot of webpages [4]. We follow this representation of a website and consider only the home page of the website, and experiment more deeply with feature selection. Hence we distinguish for each page: specific metadata used for the document description (title, description, keywords), anchor text external to the page and the body text. The classification process for the features selected will use the taxonomy. In our model, each homepage is represented by a vector of occurrences of words included in the taxonomy for each of its dimensions, as shown in figure 1. In this representation each informational feature of the page (Anchor or Linked terms (LT), selected metadata (MT) - title, description, keywords - and textual content (TC) of the page (including ALT data)) are extracted from the homepage using a parser. The occurrences of the words representing the three classes in the taxonomy are then calculated for LT, only on Structural Hypernyms for META and TC elements. All the occurrences for each dimension are then weighted using three variable α, β and γ to obtain the final representation of the homepage. Weights α, β and γ are fixed at 1 hereafter and will not be discussed here. The vector is then normalized to apply metrics as commonly suggested [2] in text classification using the vector representation but also in web page classification [3, 6].
3
Results
We try here to measure the efficiency of website classification using the representation of the homepages and the reference centroids presented above. We present two experiments: the first one will stress the approach using a reference qualitative panel, the second will access generic classification trying to ’discover’ academic websites from well-known indexes. Table 1 shows the efficiency of the
Fig. 1. Overall process of the HomePage representation using a projection on the informational taxonomy dimensions
Bx 1 Bx 2 Initial number of websites 98 83 Classification success on all classes 78 % (71 websites) 74 % (67 websites) Classification success on classes Research, 86 % (43 websites) 83.3 % (41 websites) Educational Table 1. Classification results using centroids.
process but also that their might be some mistakes: the websites of the university of Bordeaux are token as primary entrance for the calculus of the centroids. We should obtain in this class a 100% of success. In fact the results produced by the system showed us that we made some mistakes during the manual classification. Much more interesting fact is that the initial classes weren’t totally adequates to satisfy the classification. We made our approach of the organisation thinking in structure (university) and sub-structures (laboratories, educational departments or substructures, libraries, associations, registrar structures, etc.). In fact, dealing in homepage representation only with the structural navigation part of the taxonomy, it appears that they can compose very mixed activities and that is revealed by profiling the websites on thoses dimensions. In this paper we will propose to refine the efficiency of process classification and apply the system to retrieve from the web other websites that aren’t under the DNS zone but are edited by components or sub-components of the organization. We will generate and experiment that by extracting from web indexes relevant urls to specific keywords (such as ’laboratoire’) and try the prototype in extracting the noise from the results that are quite consequent in such answer results.
References 1. Isidro Aguillo. Web, webometrics and the ranking of universities. In Proceedings of the 3rd European Network of Indicators Designers Conference on STI Indicators for Policymaking and strategic decision, CNAM, Paris, march 2010. to appear. 2. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of 10th European Conference on Machine Learning, pages 137–142, 1998. 3. Hans-Peter Kriegel and Matthias Schubert. Classification of websites as sets of feature vectors. In M. H. Hamza, editor, Databases and Applications, pages 127– 132. IASTED/ACTA Press, 2004. 4. J.M. Pierre. On the Automated Classification of Web Sites. Linkoping University Electronic Press, Sweden, 2001. 5. Nathalie Pin`ede and David Reymond. De la diversit´e au lissage informationnel : cr´eation d’une taxonomie inductive pour les sites web universitaires. In 17e congr`es de la SFSIC: au cœur et a ` la lisi`ere des SIC, Dijon 23-26 juin 2010, 2010. to appear. 6. Nguyen Viet Thanh and Nguyen Van Khiet. Feature selection for a centroid-based websites classification approach. In 3`eme Conf´erence Internationale en Informatique - Recherche, Innovation & Vision du Futur (RIVF2005), Universit´e de Can Tho (Vietnam), 2005.