Visualizing Translation Variation of Othello : A ...

3 downloads 0 Views 4MB Size Report
Visualizing Translation Variation of Othello : A Survey of. Text Visualization and Analysis Tools. Zhao Geng1, Robert S.Laramee1, Tom Cheesman2, Andy ...
Volume 0 (1981), Number 0 pp. 1–7

COMPUTER GRAPHICS forum

Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools Zhao Geng1 , Robert S.Laramee1 , Tom Cheesman2 , Andy Rothwell2 , David M. Berry3 , Alison Ehrmann2 1 Visual 2 College

Computing Group, Computer Science Department, Swansea University, UK, cszg,[email protected] of Arts and Humanities, Swansea University, UK, T.Cheesman,[email protected], [email protected] 3 Political and Cultural Studies, Swansea University, UK, [email protected]

Abstract Being a global icon, Shakespeare’s plays have been translated into dozens of languages for about 300 years. Also, there are many re-translations to the same language, for example, there are more than 40 translation of Othello into German. Every translation is a different interpretation of the play. These large quantities of translations reflect changing culture or express individual thought by the authors. They build a wide connection between different regions and reveal a retrospective view of their histories. At the moment, researchers from Modern Languages collect a large number of translations of William Shakespeare’s play, Othello. In recent years, since roughly 2005, we have witnessed a rapid increase in the number of off-the-shelf text visualization tools which can benefit this study. Here we set out to utilize existing text visualization techniques and tools in order to gain a better understanding of the various translations of the Shakespeare’s work. In particular, we would like to learn more about which content varies highly with each translation, and which content remains table. We would also like to form hypothesis as to the implications behind this variations. Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: —Line and curve generation

1. Introduction The goal of this project is to visualize the various translations of Shakespeare’s work, Othello. The initial task is to identify and extract the non-semantic features from the original text of a document corpus. The non-semantic features refer to the number of words, tokens and patterns in the concordance. Text pre-processing facilitates the construction of text concordance, term relations, document relevance and other properties of interest. Based on the extracted information, various visualizations can be applied. In this document, we present the result of our survey on the state-of-art techniques and free, off-the-shelf tools for text analysis and visualization. 2. Text Preprocessing The software WordSmith [Wor96] is able to generate various text attributes, such as word frequency, parts of speech and c 2011 The Author(s)

c 2011 The Eurographics Association and Blackwell Publishing Ltd. Journal compilation Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.

any other statistical information. The outcome of the analysis invovles loads of statistical data about the word frequencies in the texts (both absolute values and compared with other texts, or compared with external corpora) and key words list (words which occur unusually frequently in comparison with some kind of reference corpus). A screen shot of the software is shown in Figure 1.

The software Concordance [Wat09] is created for people who need in-depth language or text analysis. It provides a free trial for the user. Concordance [Wat09] is able to generate indexes and word lists, count word frequencies, compare different usages of a word, analyse keywords, find phrases and publish the analysis result on the web. The screen shot of the software is shown in Figure 2

D. Fellner & S. Behnke / Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools

an index, concordance, and summary of the text. Animation is provided to enable the user keep track of the variations of relationship between different words, phrases and sentences. In TextArc, the entire text is depicted as an ellipse. Each line is drawn on the outside of the ellipse. It preserves the typographic structure of the text. In the middle of an ellipse draws each word. A word with high frequency is displayed in brighter color and larger size. If a word is used more than once, it appears at the center of all of its mentions. The accepted data for TextArc is only from the TextArc library. Figure 3 shows the visualization of the Shakepear’s play Othello generated by TextArc.

Figure 1: This figure shows the interface of WordSmith [Wor96].

Figure 3: This figure shows the TextArc [Pal02] visualization of the Shakepear’s book Othello in English. The entire text is depicted as an ellipse. Each line is drawn on the outside of the ellipse. In the middle of an ellipse draws each word. Figure 2: This figure shows the interface of Concordance [Wat09].

3. State-of-art Text Visualization In this section, we investigate the state-of-art text visualizations from two perspectives: the research prototypes for text visualization and the free off-the-shelf visualization tools. We refer [Hom11], [RVA04] and [iB10] for some lists of the available free visualization software. 3.1. Free, Off-the-shelf Text Visualization Tools In this section, we investigate the text visualization tools which are free to the public. Our work can facilitate modern language experts search for visualizations that benefit most for the analysis of their collected Shakespear’s translations. The overview of the free, off-the-shelf tools for text visualization is shown in Figure 11. In this section, we experiments these freely available tools on the 23 German translations of Othello’s speech to the senate appeared in Shakepear’s play Othello. A TextArc [Pal02] is a visual representation of the entire text on a single page. It is an advanced combination of

NameVoyager [Wat05] as a web-based visualization of historical trends in baby naming, has proven remarkably popular. The method used to visualize the data is straightforward: given a set of name popularity time series, a set of stacked graphs is produced. However, this tool does not accept user customized data sets. Tagline Generator [Meh06] is a simple PHP codebase that lets the user generate chronological tag clouds from simple text data sources without manually tagging the data entries. Once the users have populated the data source and configured the generator, it creates a list of all the unique words that have been used and counts how many times each word is used. Next it identifies the different variations of words and combines them under the most common variation using the Porter Stemming Algorithm. The size of a word indicates its frequency in the document. The brightness indicates the year of the document, the newer document is brighter. The accepted data format of tagline generator is the xml file deployed on the web. Figure 4 shows the TagLine visualization of 23 German translations of Shakespear’s play, Othello. ManyEyes [VWvH∗ 07] is a free website where anyone can upload, visualize, and discuss data. It is an experiment created by the Visual Communication Lab. The input data of ManyEyes is obtained by copying and pasting any forms c 2011 The Author(s)

c 2011 The Eurographics Association and Blackwell Publishing Ltd. Journal compilation

D. Fellner & S. Behnke / Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools

Figure 4: This figure shows the visualizations of two German translations of Othello using Tagline Generator [Meh06]. By moving the scrolling bar, user is able to see the visualization of each individual document. There are 23 German translations of Othello play experimented on this tool.

of free text. It provides a number of text visualizations, such as Tag Clouds, Phrase Net and Word Tree. Again, we apply our Othello data, which contains 23 various German translations of the play, to the visualizations in this tool. The standard Tag Clouds [BGN08] is a popular text visualization for depicting term frequencies. Tags are usually single words and are normally listed alphabetically, and the importance of each tag is shown with font size or color, as shown in Figure 5. Word Tree [WB08] is a graphical version of the traditional keyword-in-context method, and enables rapid querying and exploration of bodies of the text, as shown in Figure 6. It is a visual search tool for unstructured text, such as a book, article, speech or poem. It allows the user choose a word or phrase and shows them all the different contexts in which the word or phrase appears. The contexts are arranged in a tree-like branching structure to reveal recurrent themes and phrases. The size of a word represents its frequency. Phrase Nets [vHWV09] illustrates the relationships between different words used in a text. It uses a simple form of pattern matching to provide multiple views of the concepts contained in a book, speech, or poem. Such as given a network of words and connection pattern word "and", where two words are connected if they appear together in a phrase of the form "X and Y", as shown in Figure 7. TagCrowd [Ste08] is a web application for visualizing word frequencies in any user-supplied text by creating a tag cloud or text cloud [BGN08]. The advantage of TagCrowd is that user can define the common words themselves and these common words will be automatically reduced from the original text. Figure 8 shows the Tag Cloud visualization of our Othello data sets. The common German words are reduced. Wordle [Jon09] is a tool for generating "word clouds" from text that the user provides. Wordles are more artistically arranged (and often vibrantly colored) versions of a text. They tend to be less directly insightful as an information graphics, but often give a more personal feel to a document. The clouds give greater prominence to words that appear more frequently in the source text. The user can tweak c 2011 The Author(s)

c 2011 The Eurographics Association and Blackwell Publishing Ltd. Journal compilation

Figure 6: This image shows the Word Tree [WB08] of our Othello data using ManyEyes [VWvH∗ 07]. As we input the word "liebte", then all of sentances beginning after this word are shown. The size of a word represents its frequency.

their clouds with different fonts, layouts, and color schemes. As shown in Figure 9, is the wordle visualization of our Othello data sets. The common German words are reduced. ToxenX [Zil11] created by Brian Pytlik Zillig, is a powerful text analysis, visualization, and play tool that has been customized for use on the Walt Whitman Archive. The text base for the Archive customization currently includes the six American editions of Leaves of Grass published in Whitman’s lifetime and the deathbed edition of 1891-1892. TokenX currently supports the following features: text highlighting based on patterns in words, keyword in context, replacing words with blocks, word concordances sorted alphabetically or by frequency, word usage statistics, word substitution, user-selected replacement of words with images, creative exploration. The accepted input data format is same with Tagline Generator [Meh06], they all accept the web xml file. Figure 10 shows two visualizations of our Othello data generated by TokenX.

D. Fellner & S. Behnke / Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools

Figure 5: This figure shows the Tag Clouds [BGN08] of our Othello data set using ManyEyes [VWvH∗ 07]. The left image depicts the tag clouds for every single word, wheares the right image shows the Tag Clouds of pairs of words staring with letter "b". The ManyEyes does not provide the text preprocessing option in the Tag Cloud, such as reducing the common words.

Figure 10: On the left image shows the Tag Cloud generated by ToxenX. The right image shows the text with the words "Liebte" replaced with a heart shape.

Figure 7: This image shows the Phrase Net [vHWV09] of our Othello data using ManyEyes [VWvH∗ 07]. It depicts any two words connected with open space in the Othello play. The size of the words depict the word frequency.

3.2. Reasearch Prototypes for Text Visualizations Since 2005, we observe a rapid increase in the number of text visualization prototypes being developed. As a result, various visual representations for text streams and documents

Figure 8: This image shows the TagCrowd [Ste08] visualization of our Othello data set. The common German words or stop lists are manually defined and reduced from the original text.

are proposed to effectively present and explore the text features. By the use of the text preprocessing tool introduced in Section 2, we can collect a wide range of text attributes, such as word relationships, word frequency and sentence segmentation. In this section, we list some interesting and novel c 2011 The Author(s)

c 2011 The Eurographics Association and Blackwell Publishing Ltd. Journal compilation

D. Fellner & S. Behnke / Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools

Figure 9: This image shows the Wordle visualization [Jon09] of our Othello data sets. The common German words are reduced.

text visualizations which are able to present some of the extracted text attributes. The prototypes are listed in chronological order. The ThemeRiver [HHWN02] visualization depicts thematic variations over time within a large collection of documents, as shown in Figure 12. The thematic changes are shown in the context of a time line and corresponding external events. The focus on temporal thematic change within a context framework allows a user to discern patterns that suggest relationships or trends.

Figure 12: ThemeRiver [HHWN02] depicts thematic changes over time in a collection of patents from one company. A Document Contrast Diagram [Cla08] is a visual summary of the content of two text documents that illustrates shared words, words that are unique to one document or the other, word frequency, relative size of the two documents, distribution of emotional tone within the documents, related words based on co-occurrence, and the most common word in each document segment. It uses the familiar bubble technique and effective use of colour to contrast topic usage in two bodies of text. Figure 13 shows the Document Contrast Diagram for the 2007 and 2008 US State of the Union (SOTU) Addresses. c 2011 The Author(s)

c 2011 The Eurographics Association and Blackwell Publishing Ltd. Journal compilation

Figure 13: In this Document Contrast Diagram [Cla08], the column of squares toward the left hand side represents the segments of text from the left document. The topmost square is the first part of the document. Similarly on the right hand side. The larger of the two documents has 50 segments (squares) and the smaller document proportionally fewer.

Parallel Tag Clouds [CBW09] combines the parallel coordinates and tag clouds to provide a rich overview of a document collection. As shown in Figure 14, each vertical axis represents a category. For example, they can be different version of the Othello translation. The words in each category are summarized in the form of tag clouds along the vertical axis. When clicking on a word, the same word appearing in other vertical axes is connected. Several filters can be defined to reduce the amount of text displayed in each category. This could help create more screen space and improve the clarity of the visualization. DocuBurst [CCP09] uses a radial, space-filling layout to depict the document content by visualizing the structured text. The structured text in this visualization refers to the ISA relationship. For example, robin and redbreast is a bird. A bird is an animal. An animal is an organism or a living thing. A living thing is an entity. As we can see, such structured text can form a tree hierarchy, with the entity as the root and robin or redbreast as the leaf. As shown in Figure 15, the root node of DocBurst visualization is shown as a circle. All other nodes are assigned to a sector of an annulus. The angular width of each sector is mapped to the number of leaves or children. SparkClouds [LRKC10] integrates sparklines into a tag cloud to convey trends between multiple tag clouds. The sparklines can be used to present the trend over time. As shown in Figure 16. From a controlled study that compares SparkClouds with two traditional trend visualizations, such as multiple line graphs, stacked bar charts and Parallel Tag Clouds, results show that SparkCloudsŠ is more effective to show trends along the time. ManiWordle [KLKS10] provides flexible control such that user can directly manipulate on the original Wordle to change the layout, colour and etc, as shown in Figure 17.

D. Fellner & S. Behnke / Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools ment Collections. IEEE Transactions on Visualization and Computer Graphics 8, 1 (2002), 9–20. 5 [Hom11] H OME K.: Visualization Software, Feb 2011. http://www.kdnuggets.com/software/ visualization.html, Last Access Date: 2011-2-18. 2 [iB10] Ł ILIC A., BASIC B. D.: Visualization of Text Streams: A Survey . Knowledge-Based and Intelligent Information and Engineering Systems 6277, 6 (2010), 31–43. 2 [Jon09] J ONATHAN F EINBERG: Wordle: Beautiful Word Clouds, 2009. http://www.wordle.net/, Last Access Date: 20112-18. 3, 5, 7

Figure 15: This figure shows the DocuBurst [CCP09] a fully expanded tree structure fo IS-A relationship.

[KLKS10] KOH K., L EE B., K IM B. H., S EO J.: ManiWordle: Providing Flexible Control over Wordle. IEEE Transactions on Visualization and Computer Graphics 16, 6 (2010), 1190–1197. 5, 8 [LRKC10] L EE B., R ICHE N. H., K ARLSON A. K., C ARPEN DALE M. S. T.: SparkClouds: Visualizing Trends in Tag Clouds. IEEE Transactions on Visualization and Computer Graphics 16, 6 (2010), 1182–1189. 5, 6 [Meh06] M EHTA C.: Tagline Generator - Timeline-based Tag Clouds, 2006. http://chir.ag/projects/tagline/, Last Access Date: 2011-2-18. 2, 3, 7 [Pal02] PALEY W. B.: TextArc: An Alternative Way to View Text, 2002. http://www.textarc.org/, Last Access Date: 2011-2-18. 2, 7 [RVA04] R AJMAN M., V ESELY M., A NDREWS P.: State of the Art, Evaluation and Recommendations Regarding Document Processing and Visualization Techniques, 2004. http://arxiv.org/abs/cs/0412114, Last Access Date: 2011-2-18. 2 [Ste08] S TEINBOCK D.: TagCrowd: Joining the Crowd Together , 2008. http://tagcrowd.com/, Last Access Date: 20112-18. 3, 4, 7

Figure 16: SparkClouds [LRKC10] showing the top 25 words from the US Presidential Speeches for the last time point in a series.

References [BGN08] B.S COTT., G.C ARL ., N.M IGUEL .: Seeing Things in the Clouds: The Effect of Visual Features on Tag Cloud Selections. In HT ’08: Proceedings of the nineteenth ACM conference on Hypertext and hypermedia (New York, NY, USA, 2008), ACM, pp. 193–202. 3, 4 [CBW09] C OLLINS C., B.V IEGAS F., WATTENBERG M.: Parallel Tag Clouds to Explore and Analyze Facted Text Corpora. In IEEE Symposium on Visual Analytics Science and Technology (2009), Computer Society, pp. 91–98. 5, 8 [CCP09] C OLLINS C., C ARPENDALE M. S. T., P ENN G.: DocuBurst: Visualizing Document Content using Language Structure. Computer Graphics Forum 28, 3 (2009), 1039–1046. 5, 6

[vHWV09] VAN H AM F., WATTENBERG M., V IÉGAS F. B.: Mapping Text with Phrase Nets. IEEE Transactions on Visualization and Computer Graphics 15, 6 (2009), 1169–1176. 3, 4 [VWvH∗ 07] V IEGAS F. B., WATTENBERG M., VAN H AM F., K RISS J., M CKEON M.: ManyEyes: A Site for Visualization at Internet Scale. IEEE Transactions on Visualization and Computer Graphics 13, 6 (2007), 1121–1128. 2, 3, 4, 7 [Wat05] WATTENBERG M.: Baby Names Visualization, and Social Data Analysis. In Proceedings of 2005 IEEE Symposium on Information Visualization (INFOVIS) (2005), pp. 1–6. 2, 7 [Wat09] WATT R. J. C.: Concordance 3.3, July 2009. http://www.concordancesoftware.co.uk/, Last Access Date: 2011-2-18. 1, 2 [WB08] WATTENBERG M., B.V IEGAS F.: The Word Tree, an Interactive Visual Concordance. IEEE Transactions on Visualization and Computer Graphics 14, 6 (2008), 1221–1228. 3 [Wor96] W ORD S MITH . ORG: WordSmith Tools, 1996. http://www.lexically.net/wordsmith/index.html, Last Access Date: 2011-3-16. 1, 2

[Zil11] Z ILLIG B. P.: TokenX: a text visualization, analysis, and play tool, 2011. http://segonku.unl.edu/cocoon/tokenxcather/ [Cla08] C LARK J.: Document contrast diagrams, 2008. index.html?file=../xml/base.xml, Last Access http://neoformix.com/2008/DocumentContrastDiagrams.html, Date: 2011-2-18. 3, 7 Last Access Date: 2011-2-18. 5 [HHWN02] H AVRE S., H ETZLER E., W HITNEY P., N OWELL L.: ThemeRiver: Visualizing Thematic Changes in Large Docuc 2011 The Author(s)

c 2011 The Eurographics Association and Blackwell Publishing Ltd. Journal compilation

D. Fellner & S. Behnke / Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools

Figure 11: From left to right, top to bottom, TokenX [Zil11], TagCrowd [Ste08], TextArc [Pal02], NameVoyager [Wat05], Taglinegenerator [Meh06], ManyEyes [VWvH∗ 07] and WordleNet [Jon09].

c 2011 The Author(s)

c 2011 The Eurographics Association and Blackwell Publishing Ltd. Journal compilation

D. Fellner & S. Behnke / Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools

Figure 14: A parallel tag cloud [CBW09] revealing the differences in drug prevalence amongst the circuits.

Figure 17: The final layouts produced using ManiWordle [KLKS10] (left) and the original Wordle visualization by a user. The text is a Wikipedia entry on YU-Na Kim.

c 2011 The Author(s)

c 2011 The Eurographics Association and Blackwell Publishing Ltd. Journal compilation