an Experimental Tool for Extracting and Exploring ...

Extracting and Exploring Web Aggregates with Experimental Visualisation Tools Franck Ghitalla, Eustache Diemert, Guilhem Fouetillou, Fabien Pfaender Université de Technologie de Compiègne (UTC) e-mail : { franck.ghitalla, eustache.diemert, guilhem.fouetillou, fabien.pfaender }@utc.fr1 Introduction How to explore and describe the geography of an open, dynamic and large scale hypertext system such as the Web? We address this issue by developing an experimental tool for extracting and analyzing aggregates of Web documents. In this tool, visualization solutions play a central role in each step of the process, providing the user with powerful insights into the data collected from the Web. What we call now the “Web Mining” field has emerged as a standalone empirical if not academic field, borrowing recent advances in graph theory [1], data mining and traditional IR techniques during the last decade (see [4] for a review). Merging this heritage with new ideas, the field has produced methods and heuristics capable of correlating content analysis and link topology. Concurrently, the infoviz (Information Visualization) field has emerged from a broad substrata of traditions including design, scientific visualization, the work of Bertin [2], Tufte [18] and was finally transformed into a quite new scientific branch, producing fertile point solutions to cope with large, complex datasets (see the work of Shneiderman [17] and Munzner [16] for instance). But it is rather rare that these two branches merge their interests. Basically, this paper will describe such a marriage, focusing on the use of infoviz tools to monitor a process aimed to extract and analyze Aggregates of Web pages. First Web crawling, including connectivity and content data indexing, then applying filtering and clustering methods, and finally analyzing the results are the steps of the process to be managed. Visualization tools permit us to represent comprehensively and synthetically large data sets, which are heterogeneous and poorly structured by nature, as usual on the Web. In particular for the Web Mining field, we believe that such a graphical synthesis may enable dynamic networks exploration. Topic : Intelligent Human-Web Interaction Keywords : Visualization of Information and Knowledge, Multimedia Representation, Visualization of Complex Data, Link Topology and Sites Hierarchies, Innovative Technologies.

1. The TARENTe project 1.1Exploring and analyzing Web Aggregates with An experimental tool The TARENTe System was designed to provide multiple services including Web crawling, network analysis, data mining and information visualization. We aim at evaluating some algorithmic solutions to measure degrees of correlation between content analysis and link structure. So our methodology was built around the assumption in [9,10,11,12] that broad topics are organized in Aggregates of Web pages in which there is a strong connection between link density and topical content. In particular, TARENTe was designed to detect Aggregate structures by producing a core (a strongly connected component) and by filtering the graph to highlight a border phenomenon in the data provided by the crawling and indexing phase. These structures are supposed to show a strong correlation between content and linking, especially in the core. The experimental processing provided by TARENTe could validate or make clearer some assumptions on the Aggregate Topology of the Web [6,10,11] such as: modeling the web as a scale-free network [2] in which the connectivity follows a power law [1]; taking into account the preferential attachment process to model the graph’s growth; verifying the smallworld phenomenon. And the most important of all is Kleinberg’s assumption that the structure of broad topics on the web are Aggregates organized around a core of Hubs and Authorities (a bipartite graph) mutually reinforcing themselves [11]. The scientific aspect of our work aims to test these principles by describing the Aggregates at the local level: their “volume” (or the scale of their resource), their inner structures, and their borders (aggregates neighborhoods). We have explored some of them (topics including in the French domain : online literature, asylum seekers, Scientology, cognitive sciences) by connecting statistical analysis and visualization modules in order to produce visual as well as numerical evidences (see [8]for more details on methodology and results).

1.2The Role of Visualization tools Unlike industrial systems in information retrieval, our tool is not designed to cope with queries in real time. Basically, its aim is to give experimental hints and

mail : Franck GHITALLA - Dep. T.S.H. - UTC / Centre de Recherche P. Guillaumat / BP 60.319 - 60206 Compiègne CEDEX FRANCE

1

visualizations so as to help the user apprehend the organization and lexicon of a topic at different moments of the process. It could also help in conceiving end-user GUI solutions for a non-expert tool. Visualizations are used both to help the user take decisions about the processes and to help him formulate a typology of the topics. In particular, the modular structure of TARENTe permits to select and to change descriptors to produce evidences of the existence of web aggregates (or to produce insights on web aggregates). Monitoring via visualization is therefore a form of advanced analysis which modifies and focuses the action process. We will see that when dealing with these complex data systems, several dimensions are at stake TARENTe processes : –

Graph exploration (involving visualization of patterns, visual focus on some interesting nodes, aggregates neighborhood positioning, immersing navigation including dynamic exploration and concurrent content access (section 3)).

–

Design of an original end-user visualization which synthesize essential aspects of the analysis process. It's the purpose of the “geographical” interface (presented in section 4).

2. Extracting aggregates : topical content and link topology 2.1 Methodology Our experiments on Aggregates extraction are centered around two high level descriptors : the core and the border. We define the Aggregates at their local level by engaging computations to describe their structure with these two descriptors. The core is roughly defined by a condensation of connectivity, involving the top Hubs and Authorities, whereas the border is defined by a decrease of the similarity between a word query representing the topical content. This query is chosen by an expert of the topic that we provide with a distribution of terms within the topic resources. The computation of these descriptors can be decided at any time during the extraction process. This methodology is somewhat close to the focuscrawling techniques described by Chakrabarti and others [3] in that we evaluate the correlation between content and link structure, but differs in an important way. Basically, our approach is not IR oriented in that we don't bother with retrieving a huge number of “interesting” documents for the cheapest crawling cost. We rather aim at defining and proofing that Aggregates exist.

2.2

SCANNING PHASE : linkage structure and content analysis

The purpose of this phase is to gather and synthesize the trends of the statistical descriptors in a few graphical tools. To fulfill this task, we have designed two visualizations : one based on the topical content and its eventual dispersion amongst the pages, and a second on the link structure. Both can be computed at the same time, or one after the other if we wish to use them to monitor an Aggregate extraction process. They provide the material for a first analysis of the crawl, and then help us build a somewhat stable structure by filtering the graph and computing prominent structures. Topology

Fig. 1. SpectralViz of a dataset concerning the “cognitive sciences” topic

The SpectralViz permits us to characterize the topic at first sight and gives information about the crawl quality and the shape of the topological data. This shape is intended to be characteristic for a kind of topic and should be able to distinguish between families of topological structures. The degree of a node is the sum of its ingoing and outgoing links. The degree variation inside a topic is a first hint of its global connectivity. The variation is calculated by computing the difference between a node's degree and the average within the topic. Then, we dispatch the degrees in slices of equal size. The number of nodes displayed on the SpectralViz is the number of nodes belonging to a given slice (a given interval of degrees). These two variables permit to detect disparities in the distribution of connectivity. We suppose that for an topic following a strong power-law (of parameter k>=2.5) we should find few nodes with high degrees and lots of nodes with weak degrees, like on Fig. 1. Larges slices around the average should describe weaker power-laws (k≈1) and “latticelike” topics. Naturally, these two archetypes may be mixed on a given topic. The distance shown on the figure equals the average of the shortest paths' lengths in the non-oriented graph. If the distance is short, then the nodes of same degree are close in the network (like on the largest slice in Fig. 1). An interpretation of a short distance topic could be that we have probably retrieved an homogeneous strong tie structure. We added normalized Hubs and Auths scores to the viz, so as to locate the most interesting nodes, especially

when they are far from each others. This should give hints of the organization of the top Hubs and Auths. Content Analysis Scientology

Information

Hubbard

Clearwater

Church

Islands

Scientologie

Religious

Dianetics

Rights

Fig. 2. Similarity of the documents computed against a ten words query (chosen after a content analysis on a 34 000 pages crawl of the “scientology” topic), ordered by score

Again, we make use of a visualization to plot the distribution of the similarity, ordered by score. If the topics on the Web are really organized in Aggregates we should find a decline on the curve if there is in fact a frontier phenomenon between the lexicon inside and outside. Moreover, the decrease speed gives us a hint about the border phenomenon. If the border is lexically “sharp”, we should detect a fall, otherwise we should detect a slow decline. Given a topic known to the user, one should be able to infer the type of the border and by the way yield questions on the nature of the neighboring topics. For example, a slow decline could be interpreted as a slow topical drift towards lexically similar territories. 2.3.TUNING PHASE : stabilizing Aggregate structures, The purpose of this phase is to give tools to interpret hints about the topic's structure, so as to help the user decide which processes he could engage to explore some structuring principles. Thus, she could compose a process mixing topological and lexical computations to stress some interesting characteristics and patterns of the data. At the end of the tuning phase, one should be able to formulate an Aggregate typology given a sufficiently broad variety of experiments.

Fig. 3-4. After the same string of topological and lexical computations, graphs produced from data provided by the “scientology” (left) and “cognitive sciences” (right) topics didn't show neither the same density nor the same structures.

Again, we make use of a viz to plot the distribution of the connectivity. Like on Fig. 3-4 we can compare the results of a given processing string. Otherwise, we can use them to compare two different processing strings on a given topic (if the assumption concerning the Aggregate structures holds strongly, we should find similarities between two or more different processes in analyzing the aggregates structures). In a recent experimental campaign, we've conceived two independent processes : the first filters pages with similar content and then calculate some topological features, the second starts from a strong topological structure and extract content afterwards. In this case, we were able to measure a recovering rate between the two processes.

3. Exploring and Visualizing Aggregates Structures The aim of this phase is to visually detect significant patterns and link structures. These patterns can be of multiple types, and this task is obviously heuristic. The final goal may be the generation of a grammar or typology of the Aggregates.

3.1 Graphs and visual patterns

At every step, we try to choose carefully the data representation. For example, we mix traditional tables (like the one below) with other powerful graphical

Fig. 5. Detection of competing communities in a graph (after a HITS-like distillation)

representations. First, tables give a first insight into the dataset. Then we compare graphs representations computed using the same algorithms for a readable spatial layout.

Visual detection of interesting patterns and structures lead to the description of similarities and dissimilarities between the Authority graphs of different topics. Hence we can interpret a natural clustering composed of zones of equal connectivity density, or of similar resource

volume, and infer topological organization models (some are more hierarchic, other more lattice-like). Moreover, this phase enable analysis at a very local scale that numerical or spectral visualization cannot provide. For example, we can detect competing communities (like the church if Scientology and the antisect associations), typical objects like webrings or hierarchical trees.

3.2 Immersing navigation tool

Fig. 7 – Cartographic Viz – The gravitational model of the Web can provide a representation in terms of topic galaxies, therefore giving a powerful insight into the lexical and topological core of an Aggregate.

The cartographic model associates interactive functionalities and visual animations which facilitate the cognitive process of appropriation of the given Aggregate. The topological and semantic information is declined in three dimensions of such maps: Fig. 6. Contextual navigation tool: a mid scale graph (background), a locality graph (foreground), a browser (right) and a keyword table (left)

The navigation is oriented by a mid scale subgraph of the topic, which displays contextual linkage information in the neighborhood of a given node. The immersing dimension comes from a windowed interface involving graphs at different local scales (displayed on demand) and basic statistics of the nodes (Hub and Authority scores, in- and out-degrees). Important contextual insight can be provided by the means of proximity measures with the core, both topologically (path lengths) and lexically (a similarity measure computed against the query representing the topic). The last is represented as a subgraph that the user can explore by selecting the nodes and retrieving a measure of dispersion of the lexical data.

4. Solutions for end-user interface TARENTe generates XML description files used with an ad-hoc Flash interface to provide synoptic navigation in a prepared Aggregate. The preparation phase is done by an expert of the topic. Two types of solutions are available: a) a cartographic solution based on the display of the most important sites described by some of their topological and semantic properties (like sub-topics) like in Fig. 7. b) A traditional display of complementary resources based on trees, lists and keyword based queries.

From center to periphery : In the center are located the most representative nodes, whereas in the periphery one can find topologically less important nodes. These nodes are also less lexically centered, but give gateways to neighboring topics. – The display is divided into different zones describing semantically homogeneous sets. On Fig. 7 one can find four subsystems of the cognitive sciences topic : AI & mathematics, human and social sciences, institutions, and neurobiology. – Borders are materialized by outlining neighboring topics in the periphery (on Fig. 7, we have the English-speaking domain, and the publishers). –

5. Further work We would like to continue our work by giving an entirely graphical monitoring process to the expert, especially by integrating dynamic graphs during the crawl. – The temporal dimension of Aggregates is an important goal for us, and we plan to integrate a “player-like” interface to understand the birth, life and death of these objects. –

6. References [1] R. Albert and A.-L. Barabasi, “Statistical mechanics of complex networks” Rev. Mod. Phys. 74, (2002). [2] J. Bertin, sémiologie graphique, Gauthiers-Villars Mouton, Paris - La Haye, 1967. [3] Chakrabarti S., Van Den Berg M., Dom B. "Focused Crawling: A New Approach to Topic-Specific Web Ressource Discovery", March 1999, available on

[4] S. Chakrabarti, B. Dom, D. Gibson, R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, “Experiments in Topic Distillation”, ACM SIGIR'98 Post Conf. Workshop on Hypertext IR for the Web. [5] S. Chakrabarti, Mining the Web – discovering knowledge from hypertext data, Morgan Kaufmann, San Francisco, 2003. [6] B.-D. Davison, « Topical Locality In The Web : experiments and observations », Technical Report DCS-TR-414, Department of Computer Science, Rutgers University, 2000. [7] M. Dodge, R. Kitchin, Mapping Cyberspace, Routledge, London, 2001. [8] F. Ghitalla, C. Maussang, E. Diemert, F. Pfaender, « TARENTe: an Experimental Tool for Extracting and Exploring Web Aggregates », in Proceedings of IEEEICTTA04, Damas, Syria, April 2004. [9] F. Ghitalla (sous la dir.), La Navigation, Les Cahiers du Numérique, Hermès Editions, Paris, 2003. [10] J. Kleinberg, S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson. “Automatic Resource Compilation by analyzing hyperlink structure and associated text” in Proc. of the WWW7, 1998. [11] J. Kleinberg, D. Gibson, P. Raghavan, “Inferring Web Communities From Link Topology”, In Proc. of the 9th ACM Conference on Hypertext and Hypermedia (HYPER-98), pages 225--234, New York, 1998.

[12] J. Kleinberg, “Authoritative Sources in a Hyperlinked Environment“, in Proc. of the ACM-SIAM Symposium on Discret Algorithms, ACM Press, 1998. [13] R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins“Trawling the web for emerging cyber-communities”, in Proc. of the 8th Int. World Wide Web Conference, May 1999. [14] Menczer F., Pant G., Srinivasan P., Ruiz M.E., "Evaluating Topic-Driven Web Crawlers", SIGIR'O1, September 9-12, New-Orleans, 2001. [15] N. Mukherjea, "WTMS: A System For Collecting And Analysing Topic-Specific Web Information", Proceedings of the 9th International World Wide Web Conference, Amsterdam, Netherlands, May 15-19, 2000. [16] T. Munzner, Ph.D. Thesis, Standford University, 2001 [17] B. Shneiderman, S.-K. Card, J.-D. MacKinlay., Readings in Information Visualization, Using Vision to Think, MorganKaufmann Publishers, New-York, 1999. [18] E. H. Tufte, the visual display of quantitative information, Graphic Press, 1993.

an Experimental Tool for Extracting and Exploring ...

an Experimental Tool for Extracting and Exploring ...

Suggest Documents

TARENTe: an Experimental Tool for Extracting and Exploring Web ...

tera: an interactive tool for exploring rendering

FreP: An Electronic Tool for Extracting Frequency Information of

CycSimâan online tool for exploring and experimenting with genome ...

Extracting Noun Phrases in Subject and Object Roles for Exploring

Extracting Noun Phrases in Subject and Object Roles for Exploring

HEDEA: A Python Tool for Extracting and Analysing Semi-structured ...

FunNet: an integrative tool for exploring transcriptional interactions

An Independent Component Analysis Based Tool for Exploring ...

eXAT: an Experimental Tool for Programming Multi-Agent ... - LIA

An experimental ExcelÂ®-based tool for country-wide ... - PLOS

Virtual Environments as an Experimental Tool for Studies of Surveillance

eXAT: an Experimental Tool for Programming Multi-Agent Systems in

eXAT: an Experimental Tool for Programming Multi-Agent Systems in ...

Rencontre : an experimental tool for electronic literature - ELMCIP

An experimental ExcelÂ®-based tool for country-wide ... - PLOS

An experimental investigation of cut mark production and stone tool ...

An experimental investigation of cut mark production and stone tool ...

VoDkaV Tool: Model Checking for Extracting Global Scheduler ...

Xed: a new tool for eXtracting hidden structures from ... - CiteSeerX

Xed: a new tool for eXtracting hidden structures from Electronic

TextHunter â A User Friendly Tool for Extracting

Gadget: A Tool for Extracting the Dynamic Structure of ... - CiteSeerX

VoDkaV Tool: Model Checking for Extracting Global Scheduler ...

an Experimental Tool for Extracting and Exploring ...