Multiform Glyph Based Web Search Result Visualization

1 downloads 0 Views 177KB Size Report
word distance 1. The co-occurrence matrix visualization is created with the JavaScript visualization library D3.js, building on the 'Les Misérables Co-occurrence' ...
Visual corpus interface – putting text visualizations at use Verena Lyding†, Michel Généreux* † Institute for Specialised Communication and Multilingualism, European Academy of Bozen/Bolzano, Italy; *Department of Linguistics and Language, McMaster University, Hamilton, ON, Canada {[email protected], [email protected]} Abstract This paper presents the visual corpus interface created within the OPATCH 1 project (‘Open Platform for access to and Analysis of Textual documents of Cultural Heritage’). The interface combines a set of visualizations for textual data and language features with the aim to support linguistic analysis. The paper describes the choice of visualizations, their interlinking and their integration with a corpus search engine. It discusses challenges related to data handling and usage scenarios and lays out open needs for future work. Keywords--- visual interfaces, visual analysis, language visualization, interlinking of visualizations, language corpora, digital humanities.

1. Introduction In linguistic research as well as in the Digital Humanities collections of digital text documents, that is language corpora, have become a research tool of high relevance and prevalence. Language corpora serve as basis to study linguistic phenomena in authentic texts and to retrieve information on specialized subjects, such as e.g. history. The access to and analysis of corpora is facilitated by search and analysis tools, which provide for the automatized retrieval of data and the presentation of retrieved results. Historically, the presentation format of corpus search results has mainly been limited to the display of plain text, partly aligned on the search term (i.e. KeyWord In Context displays, KWIC, see Figure 1). More recently, different visual display formats for textual data and language features have been proposed (see section 3 below). While related visualizations have reached technical maturity as prototypical applications, they are yet mainly delivered as isolated tools and are rarely integrated with workable corpus resources or other language visualization tools. Accordingly, search interfaces, which make comprehensive use of visualizations to support analysis are still rather sparse or provide fragmented solutions. 1

http://commul.eurac.edu/opatch/

Figure 1 KWIC display for "Geschichte" 2 Within the OPATCH project an online corpus interface combining a variety of available language visualizations has been developed to serve the linguistic and content analysis of historical and contemporary texts of South Tyrolean German [1].

2. The OPATCH project The work on a visual corpus interface presented in this paper is carried out within the context of the OPATCH1 project (‘Open Platform for access to and Analysis of Textual documents of Cultural Heritage’), an interdisciplinary project carried out as joint initiative between a public library and language technological research institutions. 3

2.1. Project aims The OPATCH project aims at creating a multipurpose corpus platform, which provides a comprehensive infrastructure for the processing and delivery of digital text documents of South Tyrolean cultural heritage. It offers two front-ends for different usage scenarios and target groups [1]. On the one hand, a content-search portal 4 gives access to digitized historical newspaper data and allows for detailed filtering by metadata. On the other hand, a linguistic search portal 5 supports the linguistic analysis of texts by means of 2

Taken from DWDS corpus, Digital Dictionary of German language (‚Das Digitale Wörterbuch der deutschen Sprache‘) of the BerlinBrandenburg Academy of Sciences and Humanities, http://dwds.de 3 The partnership is composed of the Institute for Specialized Communication and Multilingualism of the European Academy of Bozen/Bolzano, the Dr. Friedrich Teßmann library (Bolzano) and, as associated partner, the Institute for Corpus Linguistics and Text Technology of the Austrian Academy of Sciences (ICLTT, Vienna). 4 http://digital.tessmann.it/tessmannDigital/ digitisedJournalsArchive/journals 5 http://commul.eurac.edu/opatch/LinguisticPortal/

integrating a set of known visualizations for language data. In the following, this paper focus on presenting and discussing visualization and design choices related to the linguistics search portal.

data relations or structural aspects of the text by targeting them through adapted visualizations. Overall, thus the visual corpus interface aims at providing a diversified and multifunctional analysis tool with for the analysis of textual data of a comprehensive text collection of South Tyrolean German texts.

2.2. Corpus data The linguistic corpus portal builds on data from three sources: data from digitized historical newspapers (section 2.2.1.), data from a contemporary local newspaper (section 2.2.2.), and a balanced selection of South Tyrolean texts of different text types (section 2.2.3.). All data sources are in the local variety of German spoken in South Tyrol / Tyrol. 2.2.1. Historical newspaper data. The newspaper corpus contains 100.000 pages of German newspapers from (South) Tyrol for the years 1910 to 1920. They are part of the historical newspaper archive from the Alpine region held at the Dr. Friedrich Teßmann library and were selected with regard to maximizing OCR (Optical Character Recognition) quality and to including full (not partial) news issues. 2.2.2. Contemporary newspaper data. The corpus of contemporary German news of the region Alto Adige consists of more than 17,000 issues of the South Tyrolean newspaper “Dolomiten” (publisher Athesia), with an overall total of more than 66 Mio tokens of text. The corpus covers data from the years 1991, 1996, 2001, 2005 and 2006. 2.2.3. Balanced set of South Tyrolean texts. The other text collection consists in the core part of the ‘Korpus Südtirol’ initiative [2] and consists of balanced texts of four genres: fiction, informative, functional (e.g. user manuals) and journalistic texts. It has a size of 3.5 Mio tokens and spans the entire 20th century.

2.3. Purpose of the visual search interface Within the OPATCH project, the visually enhanced search interface is created in order to facilitate the linguistic analysis of corpus texts. Corpus-driven linguistic research includes phases of data exploration and in-depth inspection of data. Both tasks can be supported by visualizations which provide a cognitive aid by externalizing information into graphics (cf. [3]). Furthermore, "good visualizations use graphics to organize information, highlight important information, allow for visual comparisons, and reveal patterns, trends, and outliers in the data" [4]. Following the information-seeking mantra “Overview first, zoom and filter, then details-ondemand.” [5], the visual corpus interface is designed with the aim to provide in parallel different views on the data on different levels of abstraction. Furthermore, the interface design aims at displaying different aspects of the data in parallel, which means focusing on selected

3. Related work In recent years, the concern with language-related visualizations has greatly increased 6 as well as their employment for research in linguistics (cf. e.g. [6]) and the digital humanities (cf. e.g. [7]). The following two subsections give a short overview on related works regarding specialized visualizations for textual data and derived features (section 3.1.) and integrated visualization systems for the analysis of corpora and text collections (section 3.2.).

3.1. Visualizations for text and derived features Visualizations for textual data either build on the text itself, as a sequence of running words, or relate to derived linguistic features and their frequencies, such as lexical, semantic or phonological categories. Word Tree [8], WordGraph [9], Arc Diagrams [10] and Concgrams [11] address the challenge of visualizing words in context by combining tree, graph, arc or table structures with visual highlighting and interactive features. Furthermore, language visualizations focusing on selected linguistic feature characteristics have been proposed. For example, [12] use various chart types to visualize lexical phenomena and [6] introduce a matrix visualization for studying phonological characteristics of languages, namely vowel harmony. [13] introduce Tag Clouds for comparing frequencies of keywords across texts. Various visualizations are particularly addressing the distribution of features over text positions. With TileBars [14] has presented an early visualization for the display of occurrences of search terms over text. [15] propose a similar visualization to analyze the occurrence of different characters in theater plays, and [16] present a specialized tools for the analysis of Discourse Structures. Finally, [17] discuss approaches and challenges related to the visual analysis of textual data in general terms.

3.2. Integrated visualization systems In addition to specialized visualizations for selected aspects of textual data, a wide variety of systems, which integrate various visualizations into one application, are available. These systems serve the analysis of text collections for different purposes. On the one hand, they include systems for text mining and content analysis, such as for example MiTextExplorer [18], WordSeer [19]

6

As can be observed from the recurrence of specialized workshops on the topic, such as EACL 2012 workshop of LINGVIS & UNCLH, Advances in Visual Methods for Linguistics 2012 & 2014, etc.

or Jigsaw [20]. On the other hand, we find visualization tools for linguistic and lexicographic research, such as CorpusClouds [21], or the visualizations envisioned for SketchEngine [22].

4. The visual corpus interface The linguistic portal of the OPATCH project implements a visual corpus interface by combining several existing language visualization tools into one comprehensive interface. Each visualization brings out a particular aspect of the data and provides a targeted view onto a specific level of abstraction. The interface displays the different visualizations in parallel.

4.1. Overall structure of the interface The interface is subdivided into seven display panels (see Figure 2). The topmost area displays the menu of multifaceted search options (for details see Figure 3).

present each visualization and describe its intended use for linguistic analysis purposes. 4.2.1. KWICis is a visualization for presenting results to corpus searches. By displaying, for each result, the search term and its surrounding words to the left and right it is a ‘modern concordance (keyword in context = KWIC) visualization that is interactive and designed for structured data’ 8. The KWICis visualization relies on a table structure and displays the search word and the three words to its left and right in separate columns according to their position. An extra column shows the entire text segment. Results can be ordered by each column position and an automatic clustering of results is indicated by coloring of the hit line. The clustering is calculated based on structural attributes of the corpus data, such as context (left and right) and index (position of the word in text). KWICis does kMeans clustering and offers a selection of similarity measures (matching, dice, jaccard, overlap coefficient and cosine). For the use in linguistic analyses, the interactive and structure-based KWICis visualization in particular supports the close inspection of search results. The textual context of search hits is shown in full detail, while the user is able to reorder results according to left and right word positions. The visual mark-up according to the automatic clustering provides a further indication for possibly meaningful groupings into coherent subsets. Figure 4 presents KWICis results for a search for “Geschichte” (English: “history”). Results are ordered by one position right of the search hit. Purple, green and white highlighting of the search hit distinguish the results clusters. The column displaying the entire text segment has been excluded from the screenshot.

Figure 2 Macro-structure of the visual interface

Figure 4 KWICis visualization for "Geschichte" Figure 3 Search options of the visual interface The second panel gives instructions on the use of the interface and information about the visualizations. The remaining five panels each host an individual visualization component, as described in detail below. The overall interface structure builds on the web framework from the ‘Bootstrap’ library. 7

4.2. The visual interface components The individual visual interface components accommodate specialized language visualizations for corpus search results. The following subsections briefly 7

http://getbootstrap.com/

4.2.2. DoubleTree is a compact visualization for concordance data [23], which collapses the contexts around a search term into a double-sided tree. Different contexts are represented by separate branches which are labeled with the words they are representing. Branching points can occur at every word position and the frequencies of parallel contexts are indicated by the font size of their words. The DoubleTree visualization is interactive. Branches can be expanded on click and result in a coloring of all valid branches on the opposite side of the tree. Furthermore, detailed information on the exact context frequencies as well as additional information

8

see http://linguistics.chrisculy.net/lx/software/KWICis

such as lemma and word class are displayed in a pop-up window on mouseover.

frequency, mutual information and log odds ratio informative Dirichlet prior.

Figure 5 DoubleTreeJS visualizing "Geschichte" For the use in linguistic analyses, DoubleTree in particular serves to gain an overview and general understanding of the data, as large results sets are condensed into a compact tree representation. Furthermore, its interactive features allow for dynamic browsing of the results with respect to the context on both sides of the word, and thus support data exploration. Figure 5 shows a DoubleTree for the search word “Geschichte” (English: “history”), with its right context expanded by two word positions, and valid continuations of the left context indicated by red fonts. Within the OPATCH interface the javascript version of the visualization, DoubleTreeJS 9, is employed. 4.2.3. Slash/A is a timeline visualization for comparing occurrence frequencies of (sequences of) words, lemmas or parts-of-speech within different corpora [24]. Occurrence values are plotted as a line graph, with time information represented on the x-axis and frequency information on the y-axes. The values for the different corpora are distinguished by color. Slash/A integrates statistical measures (MI, log likelihood) for calculating significant differences in the occurrences between corpora.

Figure 7 Collocation graph for immediate right collocates of "Geschichte" For the use in linguistic analyses, the collocation graphs are particularly helpful to get a quick overview on the most frequent left and right collocates of a search words, without having to deal with detailed context information (as provided in KWICis and DoubleTree). Figure 7 shows a collocation graph for collocates in the immediate right context (distance of 0 words) of “Geschichte” (English: “history”). The collocation graphs are based on the JavaScript visualization library D3.js. 10 4.2.5. Co-occurrence matrix is a basic crosstabulation visualization for the frequencies of two words occurring as neighbors of each other (to the left or right, at a specified distance). The word and its collocates are listed on the horizontal and vertical dimensions of the matrix and the coloring of the matching rectangular fields indicate the co-occurrence frequency of the respective word combinations, with stronger color opacity corresponding to higher frequency.

Figure 6 Slash/A visualization for "Geschichte" For the use in linguistic analyses, Slash/A in particular serves the analysis of language evolution over time, as well as for the comparison of terms between different collections of texts. Figure 6 shows a Slash/A visualization for the occurrence of “Geschichte” (English: “history”) within five (South) Tyrolean newspapers, years 1910 to 1920. 4.2.4. Collocation graph is a basic network visualization for the display of left and right collocates to a given search term. The visualization displays the five most frequent collocates within a user-specified context window of one to three words. They are calculated based on their strength of association score. The collocates are displayed as a graph structure, with the score explicitly given. Users can select among three scores: raw 9

http://linguistics.chrisculy.net/lx/software/DoubleTreeJS/ index.html

Figure 8 Co-occurrence matrix for left collocates of "Geschichte"

10

https://d3js.org/

For the use in linguistic analyses, co-occurrence matrixes are particularly suited to support further exploration of relations between collocates of a search word among each other, in addition to exploring collocation frequencies just in relation to the search word itself. Figure 8 shows a co-occurrence matrix for the leftside collocates of “Geschichte” (English: “history”), at word distance 1. The co-occurrence matrix visualization is created with the JavaScript visualization library D3.js, building on the ‘Les Misérables Co-occurrence’ example by Mike Bostock. 11

semantic correspondence of the presented data. More specifically, the interlinking relates to the parallelization of the visualizations with respect to the underlying data. Whenever a search query is submitted by the user, all visualizations are updated in parallel and operate on the same data as returned by that query. In consequence, the query results are processed and displayed according to the nature and data model of each single visualization in parallel and result in a coordinated set of visualizations showing different aspects of results to a single search query.

5. Open challenges and future work 4.3. Interaction of visualizations and corpus data Within the OPATCH corpus search interface, the individual visualization components (as described in section 4.2. above) are integrated with a server-based corpus query infrastructure (see subsection 4.3.1.), which is partly complemented by data transformation scripts adapted to the format requirements of individual visualization components (see subsection 4.3.2.). 4.3.1. The IMS Open Corpus Workbench (CWB) 12 is employed as core infrastructure for the management, storage and retrieval of corpus data in OPATCH. It comes with a powerful query API based on the query language CQP (Corpus Query Processor), which allows queries on word level as well as on annotation layers, and supports the calculation of basic frequency counts. For example, for DoubleTree word frequencies conditioned by their context are calculated for all different word sequences of one query result set. The CWB is installed server-side. The OPATCH corpora are annotated by lemma, Part-of-Speech and Named Entity information, and are processed and indexed for inclusion within the CWB search engine. 4.3.2. Additional data transformations are handled by custom scripts which bridge the gap between CQP query output and input formats required by specific visualizations. In particular, the time-line visualization Slash/A requires data aggregation according to metadata on years and text source (e.g. distinguishing different newspaper publishers). This massive data model is precomputed and stored on the web server as a JSON file, and loaded up into memory once when the OPATCH application starts. With the current version, only a small representative fraction of the 100k OPATCH data sets is loaded into memory. The data transformations are handled by client-side JavaScript and utilities.

4.4. Interlinking of visualizations Within the visual corpus interface as at its current state of maturity, the interlinking between different visualization components is limited to the implicit 11 12

https://bost.ocks.org/mike/miserables/ http://cwb.sourceforge.net/

Challenges regarding the visual corpus interface relate to efficient data retrieval and processing on the one hand, and to consistent interlinking and interactivity of visualization components on the other hand. The data aspect mainly boils down to scalability questions regarding retrieval tasks and the programmatic handling of client vs. server-side processing, as well as pre-loading or online loading of results sets and storage of temporary data. With respect to the usability of the corpus interface the efficiency in data handling relates to issues regarding page loading times, but also to security issues regarding data access and related liabilities concerning copyright (e.g. limited access to full texts). Future work will focus on improving the efficiency and scalability of data processing for the visualizations which require a substantial amount of pre-calculation, in particular with Slash/A. The challenge of interlinking and interactivity of visualization components touches usability questions regarding the transparency of data flows in response to the users’ queries. In addition to these user-centered considerations, interaction design decisions are also linked to technical questions of the required data loading and processing times for recalculating and updating data sets which feed the visualizations. Future work will address the explicit interlinking of the different visualization components, with the aim to allow for dynamic updates of visualizations through explicit user actions, such as the interactive selection and filtering of data directly within a visualization. For example, the selection of a specific branch in DoubleTree should trigger the updating of the KWICis display, such that only hits of the refined subset are displayed. In very general terms, open challenges also relate to the design of new visualization components for language and textual data as well as the adaptation and enhancement of existing ones. Future work foresees to adapt and integrate more innovative language visualizations into the interface. Furthermore, general usability aspects related to the visual corpus interface as a whole, and in relation to a diversified set of specialized use cases need to be addressed. Future work aims at evaluating the visual interface by means of a user study on its use for linguistic analysis purposes.

Conclusions In this paper, we presented a visual corpus interface for linguistic analysis purposes. The interface is created in the context of the OPATCH project, which provides online search and analysis facilities to corpora for the study of the South Tyrolean variety of German. The purpose of the interface as well as its overall structure and the individual visualizations have been described in detail, and open challenges as well as plans for future work have been discussed.

Acknowledgements We would like to thank Chris Culy and Maria Chinkina for technical support in adapting their tools. The project OPATCH is co-financed by the ‘Prov. Auton. di Bolzano-Alto Adige, Diritto allo studio, università e ricerca scientifica, Legge prov. 13 dic. 2006, n. 14’.

References [1]

V. Lyding, M. Généreux, K. Szabò and J. Andresen. The OPATCH corpus platform - facing heterogeneous groups of texts and users. In C. Bosco, F.M. Zanzotto and S. Tonelli (eds.): Proc. of the 2nd Italian Conference on Computational Linguistics, CLiC-it 2015. Trento: Accademia University Press. December 2015. [2] A. Abel, S. Anstein and S. Petrakis. Die Initiative Korpus Südtirol. In Linguistik Online, 38(2), 2009. [3] S.K. Card, J.D. MacKinlay and B. Shneiderman. Readings in Information Visualization: Using Vision to Think. Morgan Kaufman Publishers, 1998. [4] M.A. Hearst. Search User Interfaces. New York: Cambridge University Press, 1st edition, 2009. [5] B. Shneiderman. The eyes have it: A task by data type taxonomy for information visualizations. In Proc.of the 1996 IEEE Symposium on Visual Languages, VL '96, pages 336-343, Washington: IEEE Computer Society, 1996. [6] T. Mayer, C. Rohrdantz, M. Butt, F. Plank and D. Keim. Visualizing vowel harmony. In Linguistic issues in language technology, 4(2), 1-33, 2010. [7] G. Moretti, S. Tonelli, S. Menini and R. Sprugnoli. ALCIDE: An online platform for the Analysis of Language and Content In a Digital Environment. In R. Basili, A. Lenci and B. Magnini (eds.): Proc. of the First Italian Conference on Computational Linguistics, CLiCit 2014. Pisa, Italy. December 2014. [8] M. Wattenberg and F.B. Viégas. The word tree, an interactive visual concordance. In IEEE Transactions on Visualization and Computer Graphics, 14(6), 1221-1228. November 2008. [9] P. Riehmann, H. Gruendl, B. Froehlich, M. Potthast, M. Trenkmann and B. Stein. The netspeak wordgraph: Visualizing keywords in context. In Proc. of the 2011 IEEE Pacific Visualization Symp., PACIFICVIS '11, 123-130, Washington: IEEE Computer Society. 2011. [10] M. Wattenberg. Arc diagrams: visualizing structure in strings. In IEEE Symposium on Information Visualization, INFOVIS 2002. 110-116. 2002.

[11] W. Cheng, C. Greaves and M. Warren. From n-gram to skipgram to concgram. In Corpus Linguistics, 11(4), 411-433. 2006. [12] M. Chen, S. Huang, T. Kao, H. Chiu and T. Yen. Glance visualizes lexical phenomena for language learning. In Proc. of the Workshop on Interactive Language Learning, Visualization, and Interfaces. ACL. June 2014. [13] C. Collins, F.B. Viegas and M. Wattenberg. Parallel tag clouds to explore and analyze faceted text corpora. In IEEE Symposium on Visual Analytics Science and Technology, VAST 2009. 91-98. Oct 2009. [14] M.A. Hearst. Tilebars: Visualization of term distribution information in full text information access. In Proc. of the SIGCHI Conference on Human Factors in Computing Systems, CHI '95. ACM Press/AddisonWesley Publishing Co. New York, USA. 59-66. 1995. [15] T. Wilhelm, M. Burghardt and C. Wolff. To See or Not to See - An Interactive Tool for the Visualization and Analysis of Shakespeare Plays. In R. FrankenWendelstorf, E. Lindinger and J. Sieck (eds.), Kultur und Informatik: Visual Worlds & Interactive Spaces. Glückstadt: Verlag Werner Hülsbusch, 175-185. 2013. [16] J. Zhao, F. Chevalier, C. Collins and R. Balakrishnan. Facilitating Discourse Analysis with Interactive Visualization. In IEEE Transactions on Visualization and Computer Graphics, 18(12), 2639-2648. 2012. [17] C. Rohrdantz, S. Koch, C. Jochim, G. Heyer, G. Scheuermann, T. Ertl, H. Schütze and D.A. Keim. Visuelle Textanalyse. In Informatik-Spektrum 33, 601611. 2010. [18] B. O'Connor. MiTextExplorer: Linked brushing and mutual information for exploratory text data analysis. In Proc. of the Workshop on Interactive Language Learning, Visualization, and Interfaces. ACL. 2014. [19] A.S. Muralidharan, M.A. Hearst and C. Fan. Wordseer: a knowledge synthesis environment for textual data. In Q. He, A. Iyengar, W. Nejdl, J. Pei and R. Rastogi (eds.), CIKM. ACM. 2533-2536. 2013. [20] J. Stasko, C. Görg and Z. Liu. Jigsaw: Supporting investigative analysis through interactive visualization. In Information Visualization, 7(2), 118-132, April 2008. [21] C. Culy and V. Lyding. Corpus Clouds - facilitating text analysis by means of visualizations. In Proc. of the 4th Language & Technology Conference, November 6-8, 2009, Poznan, Poland, 521-525. 2009. [22] L. Kocincová, V. Baisa, M. Jakubí£ek and V. Kovár. Interactive visualizations of corpus data in sketch engine. In Proc. of the Workshop on Innovative Corpus Query and Visualization Tools, NODALIDA 2015. Linköping University Electronic Press. 17-22. May 2015. [23] C. Culy and V. Lyding. Double Tree: An Advanced KWIC Visualization for Expert Users. In Proc. of 14th International Conference on Information Visualization, IV 2010. 98-103. 2010. [24] V. Todorova and M. Chinkina. Slash/A N-gram Tendency Viewer - Visual Exploration of N-gram Frequencies in Correspondence Corpora. In R. de Haan (ed.), Proc. of the ESSLLI 2014 Student Session, Tübingen, Germany. 229-239. 2014.

Suggest Documents