An interactive visualisation tool for clustering ...

1 downloads 0 Views 2MB Size Report
[1] Divoli, A. and Attwood, T.K. (2005) BioIE: extracting informative sentences from the ... Anna Divoli1,2, Rasmus Winter2, Steve Pettifer2 & Terri Attwood1,2.
BioQSpace: An interactive visualisation tool for clustering MEDLINE abstracts. Anna Divoli1,2, Rasmus Winter2, Steve Pettifer2 & Terri Attwood1,2 1Faculty

of Life Sciences & 2School of Computer Science, The University of Manchester, UK

OVERVIEW BioQSpace combines a set of algorithms and an architecture for interactive clustering and browsing MEDLINE abstracts in a 3D virtual environment. The clustering is based on document similarity measures calculated on user-specified weighting of certain attributes such as MeSH-terms, word usage and specialised word lists. Several selection, navigation and reconfiguration options are provided.

THE INTERFACE

CLUSTERING

BioQSpace is an environment presented as a window in a desktop graphical user interface (GUI). The users can query abstracts from PubMed, using an embedded search facility (Fig.1). They can append many queries together and create their own list (which accepts regular expressions) of terms to be later used as high-weight terms for clustering (Fig.2), before they launch the main visualisation GUI (Fig.3). The main GUI provides several options for selection, deletion, navigation and interaction, in terms of clustering and visualisation (Fig.4).

Weight sliders BioQSpace performs pairwise similarity calculations between all the abstracts based on a set of individual attributes (collected from PubMed and BioIE[1]) which are given more or less importance according to user-manipulated sliders specifying the weight attached to them. Fig.5: The user-adjustable weight slides.

Fig.6: The advance options for changing the scoring thresholds of some weight lists.

Fig.1: The starting GUI: users can load & save their PubMed queries.

Fig.3: The visualisation GUI launched (before any user’s selection or customisation).

Fig.2: The user-defined list for weighting GUI.

Score thresholds To reduce the time of processing the data, score thresholds are used in creating subsets of some weighting lists, which are automatically chosen when loading a data-set, so that they consist of approximately 25% of the complete lists. The thresholds can be changed using the advance options.

Similarity measures & Minimal Spanning Tree The collection of weighted measures are summed and scaled to produce a value between 0 and 1, giving a final similarity measure between abstracts. A similarity matrix for all abstracts is then produced, followed by an ordered list, which in turn, generates a Minimal Spanning Tree (MST), using Kruskal’s algorithm.

Fig.4: BioQSpace screenshot:: the user has adjusted the weight slides and has selected to view an abstract while navigating in the 3D space.

VISUALISATION

Grouping & 3D configuration The tree is then traversed, ‘colouring’ each of the nodes (representing the MEDLINE abstracts) to indicate which grouping that node belongs to. A node is considered as belonging to a different group to its parent in the MST if its similarity value to that parent is less than a given threshold. Once the groups of similar abstracts have been identified, a force placement algorithm is used to generate a 3D configuration of the nodes and their links. The algorithm works by calculating repelling forces for all participating nodes, and then calculating attractive forces for only those nodes connected by edges. It considers the dominant nodes of each group first, and then the rest within a given group. Simultaneously, an algorithm for determining and rendering the Minimal Convex Hulls to contain the groups/clusters is executed.

APPLICATIONS Colouring & Hulls Abstracts that are very similar and grouped together, are being represented by the same colour nodes and encapsulated in a semitransparent hull. The hulls can be reshaped dynamically as the nodes move in space.

Fig.9: Abstracts/nodes can be: (a) viewed normally, (b) selected, or (c) marked.

Fig.7: 3D article navigation. Information on cluster formation is available on mouseover when using tooltips.

3D navigation, abstract selection or deletion, & viewing further information Users can control navigation through the abstracts with the mouse or with buttons on the GUI (Fig.8); highlight abstracts with certain keywords, terms or phrases (Fig.11); or delete irrelevant abstracts (Fig.8). They can also select abstracts to view their details (Fig.4): PubMed derived information and BioQSpace generated information such as title and abstract top keywords (in format of stems produced using the Porter algorithm) with their scores, based on normalised TF-IDF calculations. Help menus and further information such as the possible stem meanings (Fig.12) are also available.

Fig.12: Help menus & word stem meanings are available.

Fig.8: Navigation & visualisation options

BioQSpace, presents the articles as points in 3D space, and allows the user to explore that space, both in terms of 3D navigation and the raw comparison data from the articles, and to dynamically tweak the comparison algorithm to place emphasis on particular attributes that comprise the comparison algorithm.

Literature retrieval & 3D navigation BioQSpace may be used by biologists and bioinformaticians, to navigate through the full of complex concepts, biomedical literature, by providing custom requirements. The clustering algorithm and the user-adjustable weights provide a number of alternative ways to select related articles, instead of searching linearly through a somewhat arbitrarily ordered list. The 3D virtual environment allows easy navigation while exploring for associations between various biomedical concepts, entities, diseases, drugs and so on.

Fig.10: None, several or all abstracts can be selected at once.

Tooltips The use of tooltips (Fig.13) allow the user to view the high-scored keywords when selecting an abstract/node (Fig.4) or some cluster attributes when selecting a cluster/hull (Fig.7).

Fig.13: The use of label and tooltips are controlled by the users.

(a) (b) Fig.14: The same set of abstracts can cluster in a very different way if different attributes are considered. Here, 300 diabetes-related abstracts form clusters based on exclusively (a) disease & drug terms, (b) MeSH terms.

Network formations The large combination of options offered by the application can provide domain specialised cluster formations, where in turn, the users can create their own literature network by the (optional) trail-lines created during the custom navigation (Fig.15).

Fig.11: Users can mark articles by attribute.

PERFORMANCE BioQSpace was tested under Linux Fedora Core 2, running on a computer with an AMD Athlon XP 2200+ processor with 512MB RAM and an NVIDIA GeForce4 MX 440 with AGP8x. The average load time for 50 abstracts was around 8 seconds and for 500 abstracts around 210 seconds (the article comparison algorithm has order N2 and the force placement algorithm, NlogN ).

ENVIRONMENT & REQUIRMENTS BioQSpace has been built on an existing application named Q-SPACE[2], a conceptual environment consisting of objects positioned within 3D space. It combines several scripts and requires modules and libraries of Perl, Perl Tk, C, C++, Qt 3.3 and MAVERIK[3] 6.2. We are aiming to create a package for this application that it will be freely downloadable within the next few months.

Comparisons studies Cluster-formed comparisons can be made for, for instance, semantic clustering (using only MeSH-term similarity) vs. syntactic clustering (using word usage) vs. specialised interest clustering (using the predefined, the user defined word lists and their combinations) (Fig.14); whereas clustering based on the publication date can provide hints on the evolution of biological knowledge.

Fig.15: The use of the trail allows users to create a “path” while navigating through the abstracts.

REFERENCES [1] Divoli, A. and Attwood, T.K. (2005) BioIE: extracting informative sentences from the biomedical literature Bioinformatics 21: 2138-2139 [2] Pettifer, S., Cook, J. and Mariani, J. (2001) Towards Real-Time Interactive Visualisation in Virtual Environments: A Case Study of QSPACE, Proceedings International Conference on Virtual Reality 2001, Laval, France, pp.121-129, May. [3] Hubbold, R., Cook, J., Keates, M., Gibson, S., Howard, T., Murta, A., West, A. and Pettifer, S. GNU/MAVERIK: A micro-kernel for largescale virtual environments, Presence, Teleoperators and Virtual Environments, pp.22-34, I SSN 1054-7460, Vol.10(1), February, MIT Press. CONTACT: [email protected]