4th Tensions of Europe Plenary Conference

0 downloads 0 Views 2MB Size Report
Research Libraries, 65, 5, (2004), 400-425, J. W. Cortada, “Do We Live in the ...... media – text, photos, graphics, or video – together with meta-information that enables ...... Huijbregts, M.A.H. and Ordelman, R.J.F. and de Jong, F.M.G.,.
4th Tensions of Europe Plenary Conference & Closing ESF Inventing Europe Conference Sofia University, Bulgaria June 17-20, 2010

Research Collaboration Session

Title: Researching the History of Technology in Europe, Making Contact with European Research in Historical Informatics: A Conversation between the TOE/IE and the ‘Papyrus’ Research Agendas

Session Organizers: Aristotle Tympas and Yannis Ioannidis, National and Kapodistrian Univ. of Athens, Greece Jan Korsten, Foundation for the History of Technology, the Netherlands Yannis Velegrakis, University of Trento, Italy Tim Koch, Deutsche Welle, Germany Nikos Sarris, Athens Technology Center, Greece

Contact Address: Aristotle Tympas: [email protected]

Session Description: From research to writing, all aspects of historical work are greatly affected by the use of computing and related technologies. This is reflected in the emergence and rapid growth of interdisciplinary fields that stand at the intersection of computing and the humanities/social sciences, which go by names like ‘Historical Informatics’, ‘Humanities Computing’, ‘Digital Humanities’ and ‘History and Computing’. In this session we visit the history of technology and historical informatics relationship in connection to the issue of changing research practices, especially changes in the way a new generation of historians of technology (most importantly doctoral students) engage in primary research. More specifically, the session is designed so as to communicate to the TOE/IE community the outcome of ‘Papyrus’, a European Union research project that uses recent European history of technology as a reference for experimentation with advanced historical informatics research platforms. The emphasis on supporting research on multilingual digitized archives of leading European news agencies (DW, AFP) makes the ‘Papyrus’ orientation all the more relevant to the TOE/IE researchers, interested as they are in transnational history that may benefit from research on archives available in more than one European languages. Two of the institutions involved in TOE/IE have been invited by a multi-national research team of informatics researchers to participate in the design and evaluation of ‘Papyrus’: the Division of History of Science and Technology at the Department of Philosophy and History of Science, National and Kapodistrian University of Athens, Greece has participated in the design of ‘Papyrus’ whereas the Dutch Foundation for the History of Technology is leading in its evaluation.

Session Chair: Helmuth Trischler, University of Munich and Deutsches Museum

Session Commentator: Vincent Lagendijk, Technical University of Eidhoven

First Paper Title: A Collective Experiment in Historical Research on the Public Image of Technology: Opportunities and Challenges. All the way from retrieval to storage of material from digitized newspaper and magazine archives

Authors: Spyros Tzokas, Katerina Vlantoni, Theodore Lekkas and Hara Konsta

Affiliation of authors:  Spyros Tzokas, Doctoral Student, Graduate Program in the History of Science and Technology, National and Kapodistrian University of Athens and National Technical University of Athens, Greece  Katerina Vlantoni, Doctoral Student, Division of History of Science and Technology, Department of Philosophy and History of Science, National and Kapodistrian University of Athens, Greece  Theodore Lekkas, Doctoral Student, Graduate Program in the History of Science and Technology, National and Kapodistrian University of Athens and National Technical University of Athens, Greece  Hara Konsta, Doctoral Student, Graduate Program in the History of Science and Technology, National and Kapodistrian University of Athens and National technical University of Athens, Greece

Abstract A new generation of historians of technology and science is facing a new environment in regards to the form that primary sources become available. When a massive digitized archive of Greek newspaper and journals that cover the whole of the ‘long nineteenth century’ became publicly available, we started experimenting with appropriating it in our research. The paper will report on how we have been reconfiguring the tool that supported research on these archives so as to add depth and width to our research. This resulted in increasing attention to the ‘public image of technology’, as formed through media presentations from the nineteenth century onwards. Our paper introduces to three examples, ranging from changes in research on the history of the emergence of powerful engineering discourses (Tzokas) to changes in research on technological risks (Vlantoni), gender-technology co-shaping (Konsta). An appendix introduces to a control case, which concerns research on the history of software configuration in use (Lekkas). In this case no relevant digitized archives have been used.

1.

Introduction

The daily and periodical press constitutes an important primary source for multidimensional research on history in general and the history of science and technology in particular. 1 Technical and scientific periodicals, engineering journals, general periodicals and newspapers have contributed decisively to the circulation of technical and scientific knowledge, through translating, publishing and popularizing various technical and scientific ideas, practices and other technology/science-related initiatives. 2 In Greece, starting in the late 19th century, newspapers became a dominant medium for the circulation of scientific and technical knowledge to broader audiences. 3 This paper describes the first phase of a project that seeks to explore how a new generation of historians of technology and science in Greece access primary source materials in the digital age. This project is following in the past experience of the Hellinomnimon Project (1995-2002), which was implemented at the Laboratory for the Electronic Processing of Historical Archives of the Department of History and Philosophy of Science of the University of Athens. Helinomnimom aimed at supporting research and teaching through the development and application of modern technologies in photographing, cataloguing, archiving and digital image processing for historical archives and digital libraries 4. This paper is also part of one more ongoing research project, which takes place at the Department of History and Philosophy of Science at the University of Athens. This project is concerned with the public image of science and technology from the 19th to the early 21st century Greece. Recent reports and publications from this project argue that the use of daily press as main archival material can lead to the articulation of new historiographical questions regarding the study of the history of science and 1

See indicatively: A. Jones, “The many uses of newspapers”, in Technical report for IMLS project"The Richmond Daily Dispatch, (2005), http://dlxs.richmond.edu/d/ddr/docs/papers/usesofnewspapers.pdf , [viewed 10-04-2010]. 2 See indicatively: Aristotle Tympas, ‘Methods in the History of Technology’, in Encyclopedia of 20th Century Technology, Colin Hempstead (editor), Routledge, (2005), 485-489, F. Papanelopoulou, A. Nieto-Galan, Enrique Perdiguero (eds), Popularizing Science and Technology in the European Periphery, 1800-2000, Ashgate: Aldershot, (2009), F. Papanelopoulou and P. Kjaegaard, ‘Making the paper: Science and Technology in Spanish, Greek and Danish Newspapers around 1900’, Centaurus, 51, 2, (2009), 89-96. 3 E. Mergoupi-Savaidou, F. Papanelopoulou and S. Tzokas, ‘Methodological and historiographical reflections on the use of newspapers in the History of Science: The Greek case, 1900-1910’, in Arne Schirrmacher (ed.), Communicating Science in 20th Century Europe. A Survey on Research and Comparative Perspectives, Preprint Series 385, Max Planck Institute for the History of Science, Berlin, (2010), 9-26 and E. Mergoupi-Savaidou, F. Papanelopoulou and S. Tzokas, ‘Science in Greek newspapers, 1900-1910. Historiographical reflections and the role of journalists for public education’, Science and Education, (submitted), Spyros Tzokas, ‘Technical controversies during the crucial summer of 1899: The public image of the scientist-engineer’, in M. Assimakopoulos et al (eds.), Conference proceedings for the 170 years of the National Technical University of Athens: Engineers and Technology in Greece, Athens, (2010 forthcoming), (in Greek). 4 Such a collection of documents will be decisive in the study of the issues related to the introduction of the new sciences to the Greek speaking world during the 17th and 18th centuries. See: http://www.iono.noa.gr/hellinomnimon/index.html [viewed 10-04-2010].

2

technology. These reports and publications have interpreted the content of Greek newspapers in the light of a booming historiography on the popularization of science and technology. A pilot study of the full contents of newspapers that covered a period of three years (1908-1910) has shown that the journalistic discourse on science and technology fed back to the shaping of the scientific and technological phenomenon. 5 The research for this project has been exclusively based on the Digital Newspapers Collection of National Library of Greece. For the needs of the present paper, we use as examples some issues that emerged during the study of the history of technology in Greece. This study covers key episodes in the long period from the late 19th century to the first decade of the 21st century. They refer to the history of the emergence of powerful engineering discourses (Tzokas), to changes in research on technological risks (Vlantoni), to the gender-technology co-shaping (Konsta), and, to processes of software configuration in use (Lekkas). Through the appropriate and efficient use of digitized archives, we wanted to address historiographical issues of relevance to phenomena like technology and science appropriation into ‘peripheral’ European locations, constitution of scientific and technical communities, scientific and technical popularisation, public discourses on science and technology. 6 While librarians and archivists have paid considerable attention to the challenge posed by the availability of digitized archives of newspapers and journals, historians have yet to catch up. 7 Accumulated work from Historical Informatics and Humanities

5

E. Mergoupi-Savaidou, F.Papanelopoulou and Spyros Tzokas, “The Public Image(s) of Science and Technology in the Greek Daily Press, 1908-1910”, Centaurus, 51, 2, (2009), 116-142. 6 For these historiographical issues see, indicatively: Aristotle Tympas, “What Have Been Since We Have Been Modern? A Macro-Historical Periodization based on Historigraphical Considerations on the History of Technology in Ancient and Modern Greece”, ICON: Journal of the International Committee for the History of Technology, (2003), 76-106, T. Misa and J. Schot, ‘Inventing Europe: technology and the hidden integration of Europe’, History and Technology, 21 (2005), 1-19, K. Gavroglu et al, “Science and Technology in the European Periphery. Historiographical Reflections”, History of Science 42, (2008), 153-175 and Edward J. Hacket et al (eds), The Handbook of Science and Technology Studies, MIT Press, (2008). 7 Digitized Archives of Newspapers and Journals: Converted collection documentation into digital collection database. F. Cameron and H. Robinson, “Digital Knowledgescapes: Cultural, Theoretical, Practical and Usage Issues Facing Museum Collection Databases in a Digital Epoch”, in F. Cameron and S. Kenderdine (eds.) Theorizing Digital Cultural Heritage – A Critical Discourse, MIT Press, (2007), 165-191. For the experience on the development of the digitized collections of newspapers and journals see indicatively: W. Y. Arms, Digital Libraries, MIT. Press, (2001), J. Dilevko and L. Gottlieb, “Print Sources in an Electronic Age: A Vital Part of the Research Process for Undergraduate Students”, Journal of Academic Librarianship, 28, 6, (2002), 381-392, J. Gilboe, “The challenge of digitization”, The Serials Librarian, 49 (1/2) (2005), 155-163, R.B. Allen et al, “A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers”, Lecture Notes in Computer Science, 4558, (2007), 235-244, Marilyn Deegan and Kathryn Sutherland (eds), Text Editing: Print and the Digital World, Ashgate, (2009), C. Flavian Blanco and R. Gurrea Sarasa, “Online Journalistic Services: Are Digital Newspapers Complementary to Traditional Press?”, in Ada Scupola (ed.) Cases on Managing E-Services, IGI Global, (2009), 60-74, Yin-Leng Theng et al (eds), Handbook of Research on Digital Libraries: Design, Development, and Impact, Information Science References, (2009), A. M. Ronchi, e Culture: Cultural Content in the Digital Age, Springer, (2009).

3

Computing has pointed to the challenge to the historian from the transition from physically remote archives and libraries to online ones. 8 It is generally assumed that the search system and the access to the digital collections -such as the collections of the digitized newspapers and journals- is easy to handle and user friendly. The assumption of those coming to these fields from the informatics community is that the availability of primary sources in digital form provides quick and accurate results. Navigating is easy, efficient and expeditious. Is this assumption compatible with the experience of a professional historian (in our case, a doctoral student who majors in the history of technology and science) who approaches digital primary sources with specific research questions in mind? 9 As we imply over the course of this paper, efficient research on digitized archives of newspapers and periodicals requires special training, skills and techniques. While researching digitized archives, the historian faces several historiographical and methodological issues, which depend (among other things) on the type of the digitized primary material (newspaper article, journal article). 10 In the following paragraphs,

For the Greek experience on the development of the digitized archives of newspapers and journals in Greece see indicatively: Ευστάθιος Αμανατίδης et al, “Ανάπτυξη Ψηφιακών Συλλογών στο Αριστοτέλειο Πανεπιστήμιο Θεσσαλονίκης”, 13ο Πανελλήνιο Συνέδριο Ακαδημαϊκών Βιβλιοθηκών, Ιόνιο Πανεπιστήμιο, http://ionio.gr/libconf/pdfs/Sitaspsifiopoiisi-Aristoteleio.telko.pdf, (13-15 Οκτωβρίου 2004), [viewed 10-04-2010], Δ. Γαβρίλης et al, Κοσμόπολις: δημιουργώντας ψηφιακό περιεχόμενο στην ελληνική γλώσσα, (2004), http://eprints.rclis.org/archive/00005572/01/2004_Kosmopolis.pdf [viewed 10-04-2010], Ελένη Μαμμά, “Ανάπτυξη και διαχείριση Ψηφιακών Συλλογών Εφημερίδων”, 15ο Πανελλήνιο Συνέδριο Ακαδημαϊκών Βιβλιοθηκών, Πανεπιστήμιο Πατρών, (1-3 Νοεμβρίου 2006), http://conference.lis.upatras.gr/files/1.04.FullText.pdf, [viewed 10-04-2010]. 8 See indicatively: D.A. Trinkle (ed.), Writing, Teaching, and Researching History in the Electronic Age:Historians and Computers, M.E.Sharpe, (1998), S.R. Graham, “Historians and Electronic Resources: A Citation Analysis”, Journal of the Association for History and Computing, III, 3, (2000), 1-5, L. J. McCrank, Historical Information Science: An Emerging Unidiscipline, Information Today, Inc, (2001), Tibbo Helen, “Primarily history: historians and the search for primary source materials”, International Conference on Digital Libraries, Proceedings of the 2nd ACM/IEEE-CS, (2002), 1-10, J. L. Gaddis, The Landscape of History: How Historians Map the Past, Oxford University Press, (2002), M. Stieg Dalton and L. Charnigo, “Historians and Their Information Sources”, College & Research Libraries, 65, 5, (2004), 400-425, D. Cohen and R. Rosenzweig, Digital History A Guide to Gathering, Preserving, and Presenting the Past, (2005), http://chnm.gmu.edu/digitalhistory/index.php, [viewed 1004-2010], J. W. Cortada, “Do We Live in the Information Age?: Insights from Historiographical Methods”, Historical Methods: A Journal of Quantitative and Interdisciplinary History, 40, 3, (2007), 107-116, F. Cameron and S. Kenderdine (eds.), Theorizing Digital Cultural Heritage: A Critical Discourse, MIT Press, (2007), J. Herbert and K. Estlund, “Creating Citizen Historians”, Western Historical Quarterly, 39, (2008), 333-341, K. Owings Swan and M. Hofer, “The Historical Scene Investigation (HSI) Project: Facilitating Historical Thinking with Web-Based, Digital Primary Source Documents”, Journal of the Association for History and Computing, XI, 1, (2008), http://hdl.handle.net/2027/spo.3310410.0011.101, [viewed 10-04-2010], M. Dobson and B. Ziemann, Reading Primary Sources: The Ιnterpretation of Τexts from 19th and 20th Century History, Routledge, (2009). 9 Delgadillo R. and Lynch B.P, “Future historians: Their quest for information”, College and Research Libraries, 60, (1999), 245-259. 10 L. Gaddis, The Landscape of History: How Historians Map the Past, Oxford University Press, (2002), M. Stieg Dalton and L. Charnigo, “Historians and Their Information Sources”, College & Research Libraries, 65, 5, (2004), 400-425, J. W. Cortada, “Do We Live in the Information Age?:

4

we introduce to four research experiences with digitized archives of newspapers and periodicals. Common to all four cases discussed has been an interest in an ongoing recontextualization of dissertation research based on the possibility of efficient access of digitally available material. 2. Redefining the Public Role of Greek Engineers by the introduction of a new technology (Spyros Tzokas) The history of engineering has been studied in depth and from different perspectives. The historiography of technology and engineering has suggested the key role of the technical journals in the study of technical and scientific controversies. 11 The history of engineering controversies in particular has benefited considerably from the study of different types of periodicals from the perspective of the history of the constitution of scientific and technical communities and the history of the popularization of scientific and technical knowledge. Scholarly communities from other disciplines (Sociology of Knowledge, Technical Communication) have also dealt with the systematic analysis of the public discourses of engineers, as expressed in technical journals. 12 Yet, the crucial role of technical journals and, also, the daily press in the establishment of engineering communities has yet to receive adequate attention. 13 My dissertation focuses on the constitution of the professional community of Greek engineers and their institutions (including the first engineering journals), the technical debates between engineers, and the formation of ideologies and cultures concerning the public images of science and technology in Greece during the last decades of the nineteenth and the early decades of the twentieth century. The theme and the context of this doctoral work depend greatly on the study of newspapers, journals and magazines. My pilot research on this dissertation topic was based on reading the full content of newspapers for a short period, which was a slice of the fine de siècle years (1899-1901). Based on this pilot research, I came to acknowledge the public presence of engineers in the Greek society. In order to adequately address the questions that I had formulated during this pilot research, I wanted to extend my research so as to cover several decades rather than only three years. The archival material (newspapers and technical journal) I had selected to study was accessible only through the use of

Insights from Historiographical Methods”, Historical Methods: A Journal of Quantitative and Interdisciplinary History, 40, 3, (2007), 107-116. 11 See indicatively: Τέλης Τύμπας, «Για την αυθόρμητη ιστορία των μηχανικών και την … ιστορία της», Πυρφόρος, 7, (2003), 112-114. 12 For this historiographical discussion see: E. Mergoupi-Savaidou, F.Papanelopoulou and Spyros Tzokas, “The Public Image(s) of Science and Technology in the Greek Daily Press, 1908-1910”, Centaurus, 51, 2, (2009), 116-142. 13 For this discussion see: Spyros Tzokas, “Engineering journals and communities in Greece: The period before the institution of the Technical Chamber of Greece”, Neusis: Journal for the History and Philosophy of Science and Technology, 18, (2010), (in Greek) and Spyros Tzokas, ‘Technical controversies during the crucial summer of 1899: The public image of the scientist-engineer’, in M. Assimakopoulos et al (eds.), Conference proceedings for the 170 years of the National Technical University of Athens: Engineers and Technology in Greece, Athens, (2010 forthcoming), (in Greek).

5

microfilm machines. Using microfilm technology is a bit faster than reading the physical printed material but it is not as efficient as required by my research needs. This situation was changed by the availability of the searching tools provided by the Digitized Archive of Newspapers of the National Library of Greece and the recent online posting of the massive Digitized Archive of Newspapers and Periodicals of the Library of the Greek Parliament. The Digital Newspapers Collection, e-φημερίς [efimeris] is an ongoing project of the National Library of Greece. The collection contains 220.006 digitized pages of the following historical Athenian newspapers: Σκριπ [Skrip] (1893-1911), Ελευθερία [Eleftheria] (1944-1967), Εμπρός [Embros] (1896-1969), Ριζοσπάστης [Rizospastis] (1917-1983) and Ταχυδρόμος (της Αιγύπτου) [Tachydromos (a Greek newspaper published in Egypt)] (1958-1977). The user can retrieve the digitised daily press based on the main title of the newspaper or chronologically. Every page of the newspaper is offered in Portal Document Format (PDF). Moreover, the collection is serving the application of the Optical Character Recognition (OCR) so that the user can search the pages by ‘key-word’ or ‘keyphrase’ that he/she prefers. The National Library of Greece is offering its digital collections free to all users worldwide. Private copying is allowed, but any other usage is prohibited (for instance the commercial reproduction or the public reproduction or republication is not allowed). 14 The Digitized Archive of Newspapers and Periodicals of the Library of the Greek Parliament contains numerous titles of newspapers and periodicals from the late 18th century to the 21st century. Every page of the newspaper is offered also in Portal Document Format (PDF). However, multiple ‘search’ functionality, through the OCR system in the content of the documents, is not provided. This digitized archive offers only the possibility to browse all the issues by the year of publication/volume, the main title of the newspaper or periodical as well as by the name of the author of the article. The user has access to the second most complete Greek collection of newspapers and periodicals (the best being that of the National Library of Greece) but cannot make efficient use of it. 15 To illustrate the challenge I have been facing, I may take the example of my research on the introduction of the new technology of reinforced concrete in the Greek society around 1900.16 Having studied the relevant engineering publications in the past, I Εθνική Βιβλιοθήκη της Ελλάδος, Ψηφιακή Βιβλιοθήκη Εφημερίδων και Περιοδικού Τύπου (e-φημερίς), (2004), http://www.nlg.gr/dlefimerides.htm [viewed 10-04-2010], Αλέξανδρος Κουλούρης, Πολιτικές πρόσβασης και διάθεσης του ψηφιακού περιεχομένου βιβλιοθηκών, Ιόνιο Πανεπιστήμιο:Τμήμα Αρχειονομίας –Βιβλιοθηκονομίας, Διδακτορική Διατριβή, (2007), http://thesis.ekt.gr/thesisBookReader/id/15039#page/1/mode/2up, [viewed 10-04-2010], 120-124. 15 Βιβλιοθήκη της Βουλής των Ελλήνων, Ψηφιακή Βιβλιοθήκη, (2010), http://catalog.parliament.gr [viewed 10-04-2010]. 16 The results of this search has employed in a presentation at SHOT conference see: Spyros Tzokas, “Modern concretes as ancient marbles? The introduction of reinforced concrete in Greece by Elias J. Angelopoulos”, Session: A Concrete Mediterranean Engineering; Cement and Society in the European 14

6

tried to expand my research by using many titles of Athenian and Piraeus newspapers from years between 1890 and 1920. I did so based on the searching tools of the digitized archives. I selected the newspapers titles Embros and Skrip, which are available through the digitized archive of the National Library of Greece. I also chose the following newspaper titles from the digitized archive the digitized archive of the Library of the Greek Parliament the newspapers titles: Παλιγγενεσία [Paligenesia], Ακρόπολις [Akropolis], Άστυ [Asty], Νέα Εφημερίς [Nea Efimeris], Αιών [Aion], Πρωία [Proia] Εστία [Estia] Kairoi [Kairoi] Σφαίρα [Sfaira] and the engineering journal: Αρχιμήδης [Archimides]. The main problem I was facing had to do with variation in the terms used to describe a new technology. Research on many different sources suggested to me that acknowledging this variation was very important. Moreover, I had to take into account the variance of terms in three languages, Greek, French and English. Therefore, I had to include searches with key-words and key- phrases or even combinations of them that could describe the new technology. During the progress of this research I was adding new words that I found in the content of the historical articles. For example, a search was done with words meaning exactly the same thing as “beton arme”, “μπετόν αρμέ” (beton arme), “reinforced concrete”, “οπλισμένο σκυρόδεμα” (oplismeno skyrodema), “σιδηροπαγές σκυροκονίαμα” (sidiropages skyrokoniama), “σιδηροπαγές τσιμέντο”, (sidiropages tsimento) “σιδηροπαγές κονίαμα” (sidiropages koniama) “οπλισμένο τσιμέντο” (oplismeno tsimento). I began also searching by using keywords with different combinations, in order to catch dimensions of the public and social character of this technology in Greek language. The list of keywords that I use includes the following: modern AND reinforced concrete, modern city AND reinforced concrete, progress AND reinforced concrete, utopia AND reinforced concrete, Europe AND reinforced concrete, European standards AND reinforced concrete, engineers AND reinforced concrete, engineering textbooks AND reinforced concrete, public lectures AND reinforced concrete, citizens AND reinforced concrete, success AND reinforced, culture AND reinforced concrete etc. 17 In many cases the results were disappointing because the functionality of the OCR system was instable. Yet, in other cases the results were crucial for the improvement of my research. The efficient access of digitized material helped me to advance in regards to several research questions. First, it confirmed that the journalists had many expectations regarding the modernisation of the Greek urban cities through progress in private and public reinforced concrete works constructed by engineers. The new material was presented as suitable for the construction of a modern urban city, for multi-storied structures and public works resistant to the forces of nature, for most modern aqueducts, for the lengthiest of ducts, for biggest funnels, for the most durable tanks Periphery (20th and 21st centuries), Society for the History of Technology (SHOT): 50th Annual Meeting, Portugal, (Lisbon, 10-14 October 2008). 17 For the brevity of description I present here all the key- phrases in English language.

7

etc. The newly established professional community of Greek engineers sought to solidify its scientific status and its public role in the society through the journalistic discourses on reinforced concrete. The Greek engineers used the press to argue that reinforced concrete had distinct characteristics, uniquely capable of pushing the modernization of the Greek society. The rhetoric of engineers and journalists was full of enthusiasm for the new material. It contained an optimism regarding the development of reinforced concrete by industries and companies. The assumption was that concrete would allow catching up with the European technical standards and the overall technical progress. Through the use of these digitized archives, two realizations changed my initial understanding regarding the introduction of the reinforced concrete technology and the public role of engineers in appropriating this technology into the Greek society, through newspapers. First, quite simply, this research brought to light new information. According to the other sources that I had studied, the public promotion of the ‘wonderful’ qualities of the European reinforced concrete in Greece dated back to 1902. Research on the digitized archives quickly revealed that it had actually begun some years earlier, in 1897-1898. Second, I quickly realized that a special terminology for the description of the new material was introduced only when the massive usage of the reinforced in Greek society became possible. This took place around 1913-1915. Until then, engineers with different educational backgrounds, journalists with superficial knowledge of the new technology, and, representatives of construction companies who promoted the reinforced concrete, used different words to describe it. 3.

Changes in Research on Technological Risks (Katerina Vlantoni)

My PhD project is focused on conceptions of “technological accident” and “technological risk” through the study of specific case studies. While attempting to discern interesting cases in order to study, I decided to concentrate on the recent period (from 1980 onwards). One of my key research questions has to do with the public image of science and technology in cases where it is obvious that ‘risks’ and ‘vulnerability’ are at stake. Currently I am studying the notion of risk in comparison to safety regarding blood transfusion. The issue of safety is considered a very important one, especially when blood transfusions lead to the occurrence of incidents of transfusion-transmitted diseases worldwide. I am interested in investigating the debate on risk and safety as it was manifested in the introduction of a new technology of blood screening, beginning from the international literature and focusing on the Greek case. Moreover, I am trying to examine the relation between the available technologies and the incidence of transfusion-transmitted diseases in the public debate that surrounded those changes in blood transfusion in Greece. I am doing so by focusing on three domains: the domain of health practitioners, the domain of health policy-makers and the public domain.

8

During my recent research on the theme of ‘blood safety’ I began by gathering relevant secondary sources. After reviewing them, I started looking for relevant primary sources. Some of the material I wanted to gather was available in a digitized form. The availability of digital material (digitized newspapers and journals) has been really helpful during my attempt to gather primary sources. First of all, it is easier and faster to access a digital archive/library (in terms of space you do not need to commute, in terms of time you can access an archive any time of the day). Therefore, one has the opportunity to access more material. In addition, storing digitized material makes it easier to consult it in the future. Moreover, the digital archives offer the possibility to classify the material in the best possible way according to the needs of the research. Following, you may find some examples from my research. While searching for available sources on the domain of health practitioners I came across the journal “Heama”, published by the Hellenic Association of Haematology. This journal was published from 1998 until 2008 (a total of 46 issues, and 2 supplements) and is available in a digital form on the website of the Association (access is free). 18 This electronic archive is of a very simple design since it does not provide ‘search’ functionality. It offers the possibility to browse all the issues by the year of publication/volume. For some issues you can browse the articles and download in PDF format those desired. In some other issues you can get the full issue (again in PDF format). Therefore, this simple functionality of this archive does not permit any sophisticated search. A ‘search’ functionality could have saved me time while trying to locate relevant articles about blood safety. While gathering sources in order to study the public image of technology, in this case the risks associated with technologies for blood screening and the introduction of the new technique of NAT, I undertook multiple searches on Greek newspapers. I have used the various tools that the newspaper digital archives offer. 19 The online archives of the daily newspapers Καθημερινή [Kathimerini] and Ελευθεροτυπία [Eleftherotypia] have been available in a digital form since 2001. This tool is designed for all the users. It allows performing a search using keywords, author name, dates, and category of news/publication. Additionally, it offers the possibility to browse the newspapers by specific dates and retrieve a full issue. In these newspapers you can navigate to the electronic format of issues, and on these you can browse by topic to find an article of interest. The system allows you to print it, but not to save it in a digital form--you can print only the text in a plain page, not as it appeared in the printed newspaper. This type of digitized newspapers archives has certain limitations. The most important is the time span: you can search only since 2001 (this coincides more or less with the period that the newspapers started having web-pages). Another

18

Available at: http://www.eae.gr/haema/haema.htm, [viewed 01/02/2010]. For example: Newspaper Kathimerini is available at: http://search.kathimerini.gr/ [viewed 12/04/2010], Newspaper Eleftherotypia is available at http://www.enet.gr/ [viewed 12/04/2010], Newspapers Ta Nea and Vima are available (for fulltext articles available with subscription) at http://www.tovima.gr/default.asp?pid=58&la=1#, [viewed 12/04/2010]. 19

9

limitation is that the system does not allow you to sort the results in the desired way (for example either by date, or by relevance). In order to search two other popular Greek newspapers, Τα Νέα [Ta Nea] and Το Βήμα [To Vima] (which they belong to the same newspaper and magazine publishing company, namely Lambrakis Press), I used a different digitized archive. It is called the DOL Historical Archive. This archive is available to all users with a subscription fee (monthly or annual). It supports online access to a database that consists of four (4) newspapers and one (1) magazine. It is designed to offer a wide range of navigational opportunities, from selecting the publication one wishes to search, to performing search in multiple ways: browsing through dates on full issues, searching by using keywords on the title of the articles or the full text, using time periods or choosing a specific issue, author name. The system allows sorting results from a search either by date or by relevance. The selected issues can be downloaded and extracted in digital format (PDF) by page and saved on the user’s computer, as in the printed version. 20 I began searching by using keywords in different combinations like: “blood AND safety”, “blood AND molecular diagnostics”, “blood safety AND molecular”, not asking for the exact words at the beginning, a choice that is also due to the grammatical case of nouns in the Greek language. My main concern while using these archives was to narrow the results to the most relevant. This was possible in the DOL Historical Archive because you can specify in your search that you want the words to be neighbouring on the text. For example, while searching the newspaper “Ta NEA” for the keywords ‘blood and safety’ from 1980 until 2006, I got 934 results. The way the archive works makes it possible for me to retrieve only the first 500 results (sorted automatically by relevance--after the results are displayed I can sort them also by date). This is a serious limitation of this newspaper archive for a historian because it is not possible to process all the results. For the same time period, I searched the newspaper articles using the tool to have the same keywords neighbouring on the text, which returned 16 results. While processing the result it became obvious that although many irrelevant results to my research were removed, some other ones that were relevant did not appear on the new search. At the same time I performed a search using the keywords ‘accident AND blood’ from which I retrieved only few of the relevant results. Likewise, when I tried to search the newspaper with the keywords ‘risk AND blood’ I didn’t retrieve all the relevant results. My research is still in progress. My attempts to gather primary sources from the Greek newspapers about the public image of technology on the topic of blood transfusion The archive collection contains the following newspapers To Vima, Ελεύθερον Βήμα [Eleftheron Vima], Αθηναϊκά Νέα [Athinaika Nea], Ta Nea and the magazine Οικονομικός Ταχυδρόμος [Oikonomikos Taxydromos] and includes issues since 1922 until 2006, http://www.tovima.gr/default.asp?pid=58&la=1#, [viewed 12/04/2010]. 20

10

has made me reconsider some points regarding my initial set of research questions. One important issue I faced has to do with the terminology I have used to retrieve material from the secondary literature and the primary sources regarding the risks and safety in blood screening and the introduction of the new molecular method NAT (mainly from academic and professional journals). During my research on Greek newspapers I found out that the vocabulary regarding risk and accidents in connection to blood transfusion did not help me retrieve many results. Whereas the use of the word ‘safety’ was more helpful in order to locate the relevant newspapers’ articles. This is very important historiographically because the identification of ‘opposites’ as used by the actors is crucial (risk – safety, vulnerability – safety, security). 4.

Gender and Computing Technology Co-shaping (Hara Konsta)

My PhD project is focused on historical research on the public image of computing technology in Greece. The period I am interested in begins with the era of an image of a room-fool computer, which first appeared on the local press in the mid 1950’s, and reaches to the establishment of the Internet (1954-2004). The main aim of my research is to record, examine and reach conclusions about historical changes regarding gender, space, work and educational issues due to the introduction of the electronic computer. I focus on advertisements and other images in Greek popular and technical journals and periodicals that portrayed the gender-computing technology relationship; I study these images as a “window” into popular culture. I am interested especially in images depicting the gender and computing technology relationship, which have contributed to women’s limited access to careers in informatics or different roles of women in academia and R&D and above all on their scientific productivity. I look at the way men and women are related in these images, especially through the mediation of computing machines or specific machine components. For the needs of this research, I have so far went through both conventional, physical media archives, as well as digitized archives like the Digitized Archive of Newspapers of the National Library of Greece and the DOL digitized archives. 21 In order to go through the financial magazine Οικονομικός Ταχυδρόμος [Oikonomikos Τachydromos], (from 1926 through 2004) and the newspaper Το Βήμα της Κυριακής [To Vima tis Kiriakis], (from 1945 through 2006) I used the DOL digitized archive. They are both being published by the «Lambrakis Press» publishing company. 22 The starting point of my approach to the digitized archive was a rather “traditional” search under known categories, key words or even combination of them. Popular keywords and terms that define gender and computing technology were used as a query on the search engine like “πληροφορική τεχνολογία” (information AND technology), “γυναίκες και ηλεκτρονικοί υπολογιστές” (women AND computing), “ηλεκτρονικοί εγκέφαλοι” (electronic brains), “automation AND office” 21

Available on http://www.nlg.gr/digitalnewspapers/ns/main.html To Vima, To Vima tis Kiriakis and Oikonomikos Taxydromos are available at http://www.tovima.gr/default.asp?pid=58&la=1#, [viewed 10-04-2010]. 22

11

(αυτοματισμός γραφείου), “ηλεκτρονικοί υπολογιστές” (computers) etc. During my research on digitized archives I realized that the term “πληροφορική τεχνολογία” (information technology) did not help me to retrieve many results whereas the use of the word “computer” was more helpful in order to retrieve relevant articles and advertisements on the press. 23 A major concern was to specify and narrow the query results. This is an option in the DOL system because a search can be specified e.g. by selecting the words to be neighbouring on the text. From a practical point of view, digitized archives proved to be an excellent tool to scholars because of the opportunity to retrieve, store the selected material and juxtapose collection data, that they enable alternative and sometimes mutually contradictory object interpretations to appear as important. The retrieved material can be also used in a variety of ways; it can be organised, reworked based on its placement on a database suitable to one’s research interests, delivered in modular and multifarious ways. 24 The use of digitized archives enabled me as a user to link information in ways previously not possible. Even though, I initially began browsing collection overviews with key-word searches, chronologies and search under known categories, choosing to drive my own pathway through the collection, enabled me to explore polysemic object-centered narratives: the clearly engendered term “διατρήτρια” (punched-cards females) which means someone using a punch card machine, is an example of such a case, that emergences the historiographical issue of “plurality of meanings”. The experience has resulted in significant changes regarding my research questions as well as changes in my research practises. The most important change is related to the period under consideration. According to my initial research plan, I was to write a history of computing technology in Greece from 1980 onwards. Convenient access to magazines and newspapers, through the electronic archive database, opened a window on extra primary sources. For instance, a query under the keyword “υπολογιστής” (computer) retrieved results going back to 1954 and led me to re-define the time period of my research. Interestingly, when results from an initial search are retrieved and analyzed, new relevant keywords emerge. For example, within an article including the words “ηλεκτρονικός υπολογιστής” (electronic computer), I discovered a term unknown to me, but widely used in the early 1960’s to describe the new technology: “ηλεκτρονικός διερευνητής” (electronic explorer/examiner). Because I focus on the use of advertisements, witch traditionally, include images in addition to text, I found the availability of a digitized archive very helpful in regards to identifying such images. More specifically I have been able to identify images from a much broader basis of archival material than the one I originally planned to look at because I could detect the presence of an image through the use of relevant keywords. However, when searching images on the digitized archives, one of the most important “Information Technology” (or its Greek translation “πληροφορική τεχνολογία” is a term) F. Cameron and H. Robinson, “Digital Knowledgescapes: Cultural, Theoretical, Practical and Usage Issues Facing Museum Collection Databases in a Digital Epoch”, in F. Cameron and S. Kenderdine (eds.) Theorizing Digital Cultural Heritage – A Critical Discourse, MIT Press, (2007), 165-191.

23 24

12

problems seemed to be the luck of a tool on a search engine, able to detect and retrieve images under a particular category. Although DOL archive allows access to large repositories of images, the provided free-text search returns unsatisfactory or irrelevant retrieval: a large number of advertisements do not include any of the known keywords or they were just a photograph, a graphic or a sketch. Some navigation options enhanced my research while other caused confusion and left me with a massive amount of metadata to be reviewed. For example, the use of the keyword “computer” (υπολογιστής), which is a widely used term for the technology of informatics after 1980’s, could retrieve irrelevant results from the word “υπολογίζω” (calculating) to “λογιστής” (accountant), due to peculiarities of the Greek language. This actually resulted in hundreds or thousands results that were very little related to the electronic computer itself. As far as the enhancement concerns, while my initial research plan was to research gender and technology related advertisements and images, like women and computing, I discovered that the digitized archives can provide important information for computing history in Greece. As a conclusion, the digitized archive helped me search, retrieve and store vast amount of data on gender issue in new ways, setting like this a substantial revision of the way important information could be documented and linked. “Appendix”: When a digitized historical archive is ‘absent’ (Theodore Lekkas) The project and the archives My PhD research focuses on the history of software in Greece at the period between 1980 and 2000. More specifically, I seek to study how software was used through localization processes. The historiographical emphasis on use challenges the assumption that the computer is a universal technology that can automatically serve all needs in a very effective way. Therefore, I make specific references to the historical context of the development of software in Greece, to the role of the users in this country, and to relevant processes of localization-domestication of the software technology in Greece. I am currently studying the role of software piracy in the history of the formation of the Greek software industry. The high piracy rates recorded in the Greece make this case-study all the more suggestive. As I am finding out, defining and identifying software piracy, and, moreover, interpreting its influence in the development of software and the software industry, is a much more complex affair than it is usually assumed. My own research in the history of computing technology and software in Greece has suggested that “software piracy” activities go beyond the sphere of business and economics, because they reach into the political and the cultural sphere. To retrieve and interpret this reach, we need a historiographical emphasis on the ‘technology-in-use’ that is on the study of technology that includes the long-run end user. Studying the software piracy in Greece may reveal various important tensions, including tensions between software producers and end users, between government 13

and the software industry, and, between Greek and foreign (mostly US and European) software vendors. Concluding this brief presentation of my current research, I may suggest that the study of software in a specific social context (e.g. in a country, in this case Greece) may be proven crucial while attempting to understand the structure and development of computing technology and its relationship to society. My primary sources are journalistic articles from the three longest running Greek home comput ing periodicals, Ram, Computer for All and Pixel. These periodicals have been key intermediaries between the end user and the software producer. 25 In addition, I have used press releases from organizations involved in the software industry, including organizations set up to address the issue of software piracy. The list includes press releases by BSA Hellas. Finally, I have relied on surveys that reported on the state of the Greek IT industry in general and software piracy in particular. Digitalized archives? The first issue of the ‘Computer for All’ periodical 26 was published in January of 1983 and, for at least 15 years, served as one the main channels through which the home computing users came across with the new computing technologies, learned about the Greek software houses and their products and compared prices and services through extensive and specialized articles. The related ‘RAM’ magazine 27 first appeared in February of 1988 and soon became the best selling computer magazine in Greece. It was published by most acknowledged mass media group in Greece, Lambrakis Press 28. Having a more ‘professional’ profile, ‘RAM’ magazine was addressed to a wider reading audience, unlike the ‘Computer for All’ magazine which attracted the more ‘hardcore’ pc users. The third periodical that constitutes a valuable historical source is the ‘PIXEL’ magazine, which first launched as a supplementary of ‘Computer for All’ in October 1983 containing ‘listings’ of programs and games for users of home computers. The three magazines resulted from the need for a medium that could provide information, communication and problem-solving that began to arise in the early 1980s, when the number of the users of home and personal computers in Greece reached a critical size. Surprisingly, all three magazines are not available in a digitized form, though most of them have its own web site. All the available issues can be found at the aforementioned libraries in hard copy. It is obvious that the absence of electronic material eliminates any search capability. The only magazine that can be accessed in 25

Robbie Guerreiro-Wilson, Lars Heide, Matthias Kipping, Cecilia Pahlberg, Adrienne van den Bogaard, and Aristotle Tympas, ‘Information Systems and Technology in Organizations and Society: Review Essay’, in ‘Tensions of Europe’ Network First Plenary Conference Proceedings, Johan Schot et al. (editors), Budapest, Hungary, 2004 (CD-ROM). 26 http://www.cgomag.gr/ 27 http://www.4pi.gr/ram/2010/04/ 28 http://www.dol.gr/defaulte.asp

14

electronic way is the PIXEL magazine, which has been scanned and uploaded on a portal by some 80s enthusiasts! This team of hobbyists created a database, called RetroDB, which can be found at http://www.retromaniax.gr/vb/forum.php. A sub team is formed by former home computer users that want to ‘resurrect’ the atmosphere of the ‘80s and along with it the original sources, like magazines and tv shows. A complete collection of the PIXEL magazine, scanned and saved in jpeg format is one of the offspring of this effort. The scanned material can be easily retrieved through a typical membership process at the aforementioned portal. Of course, the problem persists as you cannot search in the scanned pages and you have to navigate from one page to the other by unzipping them one by one. So, the historian of technology has almost no capability of performing a dynamic search through keywords and specific terms in this digitized material.

15

Second Paper Title: Appropriating SHOT’s classification in the context of the construction of the ‘history ontology’ of a second generation historical informatics research tool: Examples from research on computing and biotechnology

Authors: Akrivi Katifori, Nadzeya Kiyavitskaya, Costas Morfakis and Giannis Binietoglou

Affiliation of authors:  Akrivi Katifori, Ph.D., Post-Doctoral Research Fellow, Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Greece  Nadzeya Kiyavitskaya, Post-Doctoral Researcher, Department of Information and Telecommunication Technologies, University of Trento, Italy  Costas Morfakis, Doctoral Student, Graduate Program in the History of Science and Technology, National and Kapodistrian University of Athens and National Technical University of Athens, Greece  Giannis Binietoglou, Doctoral Student, Division of History of Science and Technology, Department of Philosophy and History of Science, National and Kapodistrian University of Athens, Greece

Abstract The paper reports on issues that surfaced while laboring to develop what informaticians have been calling a ‘History Ontology’, namely an inclusive set of historiographical considerations that could serve as the basis of the Papyrus design. Central to this paper is the report of an experiment with building this ontology around some formal SHOT (Society for the History of Technology) subject classifications and periodizations. For the design of a ‘Papyrus’ prototype, three areas of research on the recent history of technology were selected as especially important: the history of biotechnology and biomedical technology, the history of responses to climate change as affecting the course of energy technology in general and renewable energy technology in particular, and the history of ICT (computing-telecommunications and related technologies). The session will cover the development of a ‘History Ontology’ to accommodate the first and the third of these areas. The archival material taken into account in developing this ontology were news items available from two leading European news agencies, DW and AFP.

1. Introduction The ongoing digitization of media archives (textual and visual) of the last few years affects the way historians and other humanists and social scientists work with these archives. The massive availability of archival material in digital form and the difference between printed and digital archival material creates new needs for research tools that go beyond simple search and retrieval (see preceding paper of this session by Tzokas et al.). “Second generation” digital information retrieval tools are currently designed so as to provide an effective way to access archival material, in a manner tailored to the historians’ needs. Papyrus is an interdisciplinary research project of the European Union that attempts to support these needs. It brings together historians of technology and science who specialize in using media archives, journalists, and, computer scientists who specialize in the computational technology involved in storing and assessing media archives. It attempts to investigate issues related to the use of media for historical research and to propose ways to support this research. Its main objective is to provide a dynamic digital library that will accept queries in terms relevant to a history researcher and then help this researcher to look for media content relevant to this query. The results are presented in a way useful and comprehensive to the user, at the same time providing context information on related historical concepts and historiographical issues. This process brings additional contextualization and ongoing refocusing of research within the reach of a historian. To test the Papyrus concept, the computers scientists and the historians who work together in this project decided to focus on the History of Technology and Science. The available news archives, provided by our news provider partners, Deutsche Welle (DW) and Agence France Presse (AFP), contained news items related to various topics that could be of interest to a historian of science and technology. We selected two, taking into account their availability in the archives and their importance and social implications. These were renewable energy and wind power (1), and biotechnology and biomedical technology (2). In other words, Papyrus has been designed so as to accommodate the research needs of a historian of technology and science who works on topics of relevance to the history of wind power and renewable energy more generally or on the history of biotechnology and biomedical technology. However, these topics can be generalized so as to cover other History of Technology and Science domains. In this paper, we focus on the way the Papyrus team has handled the core of the Papyrus design, which is called the Papyrus ‘ontologies’. Central to this design was the decision to integrate into Papyrus the classification of historiographical issues contained in the formal subject index of the Society for the History of Technology (SHOT) and the History of Science Society (HSS).

2

The study of user needs – in this case historians of technology and science – has been the basis of our work and is described in section 2, followed by a brief outline of the Papyrus concept in section 3. The paper continues with section 4 that focuses on the Papyrus ontologies, which are the heart of Papyrus. It concludes with a presentation of the main functionality of the Papyrus platform available in section 5, which is followed by the conclusions and our plans for future work in section 6. 2.

The need for tool support in historical research in the digital age

Our group has undertaken in the past user studies related to the identification of historians’ needs and research methods [2], [3]. However, in the context of Papyrus we performed a more in-depth study of user needs and requirements with the help of the history partners of the Papyrus research consortium [1]. A dimension that was particularly emphasized during the user studies is the educational value of the Papyrus platform [4]. The combination of existing historical research results in the form of essays and terminology with archival material is considered of particular importance for the education of history students and the training of new researchers. All users needs identified have been the basis of the design decisions undertaken within Papyrus. The most challenging needs are presented in this section. 2.1 Research method An important step towards understanding user needs to be supported within Papyrus is the study of representative topics and questions for historical research. An example maybe the following: “I am interested in information on the changes in biotechnology from the beginning of the 20th century until 1970”. History researchers proceed in specific steps when attempting to gather the material needed to investigate a specific topic like the aforementioned one. These steps include (in any order): •

Collecting relevant secondary material, which includes essays of other history researchers on related subjects. This material typically comes with a set of common vocabularies used by historians to refer to the topics covered by particular essays. This could contain historiographical issues, like “Controversies and Disputes” or “Discipline formation” or change in science as well as general concepts like that of religion or politics.



Collecting primary material, i.e., news archive content related to the research subject. This material usually comes with a different set of vocabularies, the one prominent during the time of the creation of the archive documents.

A very relevant issue to our project has been the way that historians use to search and explore archival content. Either with printed material or with digitized one, their preferred methods seem to be keyword searching and exploration. More specifically, the usual way to proceed when searching for relevant material is to break down the research topic into keywords and then try to find material related to these keywords. 3

Through our study it was evident that historians feel comfortable with keyword search and it is their main method for retrieving content from an archive. However, most of them pointed out the deficiencies of existing keyword search tools for archives, related both to precision and recall. As a result, it is important for them to be able to have an effective keyword search tool to support archival research. Another important requirement is the one for providing efficient ways to browse vocabularies and catalogues related to their research. 2.2 The concepts in time and space As emphasized by our history partners, a very important issue in their research is the change in concepts with the passage of time, which may include changes in their name or definition. An example is the modeling of the history of the term “biotechnology” which has changed its meaning and names many times within the 20th century. Biotechnology as a concept and scientific discipline has progressed from food technology and fermentation to genetics and biomedical engineering [5]. Time is an important factor as the assignment of time periods, in some cases not having exact limits, is essential for describing this evolution of concepts. The issue of multilingualism in the context of a digital repository providing access to archival content of different countries and in different languages is particularly important for historical research. One dimension of the problem is related to the fact that a term may have been introduced in different time points in different languages. For example, “biotechnology” has undergone different development paths in Germanspeaking and English-speaking countries [5]. A second, more complicated dimension of the problem is related to the fact that the same term, during the same time period, could mean different things in one language than in another. If we take for example the development of biotechnology in the German-speaking countries, there were two terms with different connotations used to refer to this one term in English, i.e., “biotechnik” (biology-based technology) and “biotechnologie” (microbiology and fermentation). In general, different terms can be used to describe the same concept under different contexts, essentially different working environment conditions, specified by the parameter values of time, place, language, dialect, domain, historiographical issues (i.e. social, cultural etc.), viewpoint, formality and diatype (i.e. a language variation, determined by its social purpose). 3. The Papyrus concept In our attempt to accommodate the recorded user needs, we applied and extended in Papyrus existing semantic web technologies to build a platform that will provide advanced access to News Archives.

4

In this section, we give a high-level view of the functionality of the Papyrus platform and explain the interactions of its components. At the basis of the platform we may see the News Archives, in our case the Deutsche Welle and AFP ones. On top we may see two categorizations of terms, the News ontology and the History ontology. More details on these are available in Section 4.

Figure 1 - The Papyrus Platform News ontology terms are used to annotate the news content whereas history ontology terms model knowledge interesting to the historian. The mappings between the ontologies express correspondences of terms between the two domains. Mappings can be trivial ones (e.g., capturing an exact correspondence between two entities), as well as more complicated ones, according to the way history researchers use news related keywords to retrieve information on specific historical topics. The Papyrus platform offers a specialized browser which together with the keyword search and the mapping mechanisms of the platform enables users to navigate from history ontology concepts to news ontology concepts and achieve effective accessing of the primary material in the archives. Section 5 discusses the details of the browser and presents keyword queries with a temporal dimension in the context of Papyrus. The Papyrus platform offers also a number of web tools to facilitate ontology editing, creation of mappings between the two domains and management of news content and analysis results.

5

The following sections focus on the end user aspect of Papyrus and present the two ontologies as well as the query and search mechanisms. 4. The Two Ontologies The two ontologies are the heart of the Papyrus platform. The term “ontology” in computer science refers to a set of concepts that model a specific domain of interest [12]. These concepts form a categorization, have properties and are related through various relations. Ontologies offer a level of formalism that is both readable by humans and computers. As a result, the user is offered a transparent overview of the domain described by an ontology and also be used by a computer for more intelligent information retrieval operations. Taking into account these advantages, we have been using ontologies as tools to model both our application domains, History and News. The following sections present the process of creating these ontologies in more detail. 4.1 The History ontology In our effort to “formalize” the History of Science and Technology domain in an ontology, our historian partners suggested two very important societies in the field. The one is the History of Science Society1 with its journal “ISIS” and the other is the Society for the History of Technology2 with its journal “Technology and Culture”. These were used as sources with the following objectives: •

Periodization. In order to represent time properly within the ontology, we needed to have an in-depth understanding of how historians use time periods and chronological divisions.



Classification. In order to create a rich as well as structured ontology, we needed to study the formal classification used in the subject index of the journals of the two selected societies.

After collecting this information, our historian partners proceeded with organizing it in a list of time periods of interest. Furthermore, the historians combined the two subject classifications, selected a set of inclusive subjects, and clustered them in the following six sets: 1. change in science/technology, 2. institutions, 3. research and development, 4. controversies and disputes, 5. popularization, and 6. ethics. 6

We call such subject clusters, systematically used for historic research in the area of science and technology, historiographical issues. Table 1 includes the full list of the subjects selected and the subject clusters that they contain. 1. change in science/technology: change in science, change in technology, environmental history, discipline formation, discovery (in science), artifacts, experiments and experimentation, academic disciplines, scientific communities, professions and professionalization 2. institutions, universities and colleges, societies, institutions, academies, (international) congresses, conferences, and meetings, research institutes, research schools, research stations, laboratories, prizes, awards, Nobel Prizes 3. research and development, technological innovation, impact of technology, technology assessment, public policy, government sponsored science, patents, big science, science and industry, technology and industry, entrepreneurs and entrepreneurship 4. controversies and disputes, determinism, progress (ideas of), revolutions in science, globalization, modernization, international cooperation, futurism, utopias, authority of science, technocracy, controversies and disputes, political activists, non-governmental organizations, risk assessment, biological diversity, safety, limits of science 5. popularization, popular culture, rhetoric, metaphors and analogies, public opinion, public understanding of science, expert testimony 6. ethics, science and ethics, technology and ethics, privacy, private life, interprofessional relations Table 1 - General History Ontology Subjects Then, these clusters were further specified into an extended list of concept candidates and arranged into a set of concepts, instances and relationships to be inserted in the ontology. To model the historian world, we started from the CIDOC reference model [9]. The CIDOC is the model that provides definitions and a formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation. For example, the model describes the way of an artifact's production, transfer of its physical custody, and assignment of various attributes. In order to preserve the CIDOC in its consistency, we decided to keep irrelevant concepts and modify the model by augmenting it with additional knowledge that is important for the needs of historians, e.g., historiographical issues from those specified in Table 1. For more details on the History ontology please refer to [6]. 4.2 The News ontology The Papyrus News ontology is a tool to be used for news item categorization and retrieval by the News Agencies. It provides a rich set of metadata to be used for the 7

annotation of news content, either manual or automatic. News agencies may use this metadata for the classification and retrieval of the news content. In order for the News ontology to be useful for exchanging news content between news agencies, it conforms to existing news ontology standards, with more prominent the one designed by IPTC3, NewsML-G2. The first version of this standard, NewsML, was designed by the International Press Telecommunications Council (IPTC) in 2000 to provide a media-independent, structural framework for multi-media news. At the heart of NewsML is the concept of the news item which can contain various different media – text, photos, graphics, or video – together with meta-information that enables the recipient to understand the relationship between components and understand the roles of each component. An updated version of this standard, named NewsML-G2 [7], was released in 2008. It is a member of a family of complementary IPTC news exchange format standards, collectively known as G2-Standards, which also offers a standard representation of news events and another for sports results and statistics. This model is made of two parts: a structural model representing news items and news packages, and a basic model of semantic concepts useful for the annotation of general news, e.g. people, organizations and locations. The IPTC Subject News Codes are sets of topics (aka topical subjects) to be assigned as metadata values to news objects like text, photographs, graphics, audio- and video assets. However, the existing NewsML-G2 constructs were not sufficient for the needs of Papyrus. The historian approaches the content of a news agency item from a different angle than the journalist-author of this item. To this end a combination of content analysis techniques and manual modeling taking into account user needs was undertaken, resulting in a richer News Ontology, where named entities have been introduced, as well as important news item concepts. In this model, each news item is identified by its URI and can have a list of related topics that may contain: themes -those established by the IPTC categorization [10] to be respected by the news agencies when annotating their news content, as well as domain-dependent - and terms, such as named entities (Person, Organization, Location), concepts (other objects or notions), or slugs, i.e., terms defined as relevant to the IPTC subjects. In turn, each term can be defined by a set of keywords. This work has been undertaken for the two main domains that Papyrus focuses on: •

Biotechnology and biomedical technology



Renewable energy with focus on wind power.

In order to populate our news ontology with relevant knowledge, we combined several efforts. First, we manually generated a classified list of named entities using as a basis the preliminary list of keywords mined by a statistical technique from a corpus of documents on the topics of interest. Second, a set of relevant news categories from 8

the IPTC Subject News Codes was selected. Another extension made is adding a set of more specific domain-driven themes, their related terms and keywords, so that the news content from our domains of interest can be accurately classified. The result has been a model for a news ontology conforming to aforementioned standards and extended to include relevant concepts to the two selected domains. 4.3 Bridging History and News As already mentioned, the History and News Ontologies model the domains Papyrus focuses on from 2 different perspectives. The perspective of news professionals attempts to record current events; whereas, the history one examines their importance and interpret them within a greater socio-cultural and temporal context. One of the main challenges that our team had to face was how to bridge these two different perspectives and provide a framework to create correspondences between concepts of the two ontologies. These correspondences, or mappings in the language of computer scientists, would in fact enable the users of the Papyrus platform to move from the history ontology to the news one and retrieve news content. To illustrate mappings between the History and News ontologies, we may consider the following example: the news archives may talk about the term “Old World” that should be mapped to the parts of the world known to Europeans before the 15th century. We may formally define that the entity “Old World” in the News ontology corresponds to the entities “Europe”, “Asia”, and “Africa” of the History ontology. Moreover, we should restrict this mapping to be valid only after the 15th century. These kinds of dependencies must be described in a formal way, so that they can then be used in the query execution scenarios. Query execution may result in different answer sets depending on the constraints taken into account from both ontologies. Discovery and population of these mappings has been one of the most challenging tasks our team had to face. Existing ontology matching tools turned out to be ineffective in addressing this challenge, because they were developed with a different assumption in mind, i.e., for identification of similar construct in ontologies essentially describing the same application domain. Thus, automated solutions can be used to identify most trivial cases, where, for instance, we may want to record that cloning in the History ontology may be mapped to keywords like “clones” or “pet cloning” in the News Ontology. However, there are more complex cases of mappings where historiographical issues may be involved also. For example, when we are interested in the “public opinion on stem cells”, the combination of the concepts “stem cell” and “public opinion” should be mapped to combinations of keywords which, among others, could contain keywords related to the Catholic Church, which has affected the public opinion on this issue.

9

This kind of mapping is harder to record automatically and the help of our experts, the historians, is essential here. Papyrus offers a tool called TrenDS [8] which allow the historians to explore the 2 ontologies and create manually mappings between them. It is evident that Papyrus cannot initially accommodate all possible mappings. As new research topics appear and the History ontology is extended, there is a need for constant feedback from the users, the history experts, as to proposed mappings for the system. We envision Papyrus as a dynamic system that may even through different sets of mappings accommodate different points of view of the same topic. 5. Browsing and searching history with Papyrus Having modeled our two domains, History and News, and defined the way to bridge them, the next step was to create tools that would allow the historians to access and explore this rich material. Taking in account the need for both search and browsing of the information offered through the system, we designed and implemented two different ways to access the same material: a browsing tool, called the Papyrus Browser, and an advanced search mechanism, Cross-Discipline Search. These are presented in the following sections. 5.1 The Papyrus Browser The Papyrus browser is a web-based tool that allows the exploration of news content, as expressed by the news ontology, from the History point of view. It is one of the few web-based ontology browsers available and the only one which is directly designed for the end user and not the ontology designer. It is a specialized tool which combines two different domain ontologies as well as the content they describe. As it was defined early on, the main objective of the browser was to provide the historians the possibility to have both ontologies available at the same window and through them reach the news content. The Papyrus Browser, combined with the keyword search functionality over the history ontology, is envisioned to be the tool for researching effortlessly both primary (news ontology and content) and secondary (history ontology) material. The Browser aims to provide different views to the two ontologies, available through different tabs. The “Simple Browser” tab offers the possibility to explore the structure and instances of both ontologies, whereas the historiographical issues view (“Papyrus Browser” tab) attempts to provide step-by-step access to the material. Based on the historical research method identified in the requirements phase (Section 2), the following steps have been designed and implemented: 1. Select historiographical issues and domains of interest. 2. Select history ontology concepts and instances that are of interest.

10

3. View news ontology concepts and instances related (via mappings) to the ones selected in step 1 and 2, and select the ones that are of interest. 4. View news items, the actual multimedia news content, related (via analysis) to the concepts selected in step 3. The historian is presented with all this information at the same browser window. Textual or multimedia information like concept and instance descriptions or the news items themselves are presented in a separate pop-up window. A partial view of the “Papyrus Browser” tab is presented in Figure 2. This functionality of the Historiographical issues view has been implemented with 5 panels: •

Historiographical Issues. This panel presents the hierarchy of the History ontology historiographical issues concepts [6].



History Domains. These are the domains that have been selected for the context of Papyrus, i.e., Biotechnology and Renewable Energy.



History Ontology. This panel contains a list of related history ontology concepts and instances according to the selected domain(s).



News Ontology. This panel contains a list of related news ontology concepts and/or instances according to a selected history ontology concept and/or instance.



News Items. This panel lists the news items that are related to the selected news ontology concepts.

For the example of supporting research on the change in the domain of Biotechnology, the user could take advantage of the Papyrus Browser, to select the appropriate historiographical issue (Change in Science) and the domain (Biotechnology), as shown in Figure 3. The user can then see related concepts and instances (Figure 3, right) and by clicking on them view their properties (Figure 4).

11

Figure 2 - Part of the historiographical issues view where the user may select a domain (Biotechnology), a historiographical issue (Change in Science) and a History ontology concept/instance (Biotechnology)

Figure 3 - Properties of the scientific discipline of Biotechnology Another example is presented in Figure 4. The user is interested in “Public opinion on stem-cells”. He has selected the Domain “biotechnology”, the historiographical issue “Public opinion” and the concept “Stem cell”. He may view information on the concept stem-cell as well as related concepts in the news ontology. The Roman Catholic Church appears as a related news ontology concept because a historian has inserted in the system a mapping that relates this concept with public opinion on stemcell issues. As a result, related news items appear on the right down side.

12

Figure 4 - Overview of the Papyrus Browser

5.2 Cross-discipline search As already explained in previous sections, the Papyrus project attempts to bridge the news and the history domain. The historian is presented with the screen as shown in Figure 5. In this screen there are three options available to the user for searching. 1. Keyword search is the simplest option, which looks for the given keywords only in the text of the news items. 2. Semantic Search looks in the text but also in the news item metadata produced by the content analysis modules, which relate the news content to the news ontology and other external sources of metadata, like Wikipedia. 3. Cross-discipline Search firstly identifies related entities to the given keywords in the History Ontology and then, through the mappings, retrieves news content. The most innovative of the three Papyrus searches is the last one, which allows the historian to start from the History ontology and his/her topics of interest and then retrieve related news items. To demonstrate this process, suppose that a historian would like to find news about artifacts related to genetically modified organisms. 1. She types the keyword “artifacts genetically modified organisms” in the search-box of the screen of Figure 5.

13

Figure 5 - Papyrus main search screen. The user types the keyword “artifacts genetically modified organism”.

2. When the user clicks “Search”, a list with history concepts appears on the left (Figure 6), which are those related with the given keywords.

Figure 6 - Results on the search on the History Ontology.

At a final step, the user clicks on the concept “Dolly the Sheep” and then clicks the search button labeled, “Search”, to retrieve related material from the News

14

Ontology and news content using the stored mappings. Selected concepts appear in the “Selected” list on the right of screen shown in Figure 6.

Figure 7 - News items result results

6.

The Challenge Ahead

In this paper we presented the progress made so far with the Papyrus project, a second generation digital tool to support historical research. Our research is multi-disciplinary and embraces several research issues at different levels: modeling different domains, establishing correspondences between heterogeneous models, knowledge discovery and querying, development of innovative environments to make our tools useful and accessible by our target users, and other issues. For most of these, we have already provided solutions. However, as the project reaches its final stages, preliminary evaluation has indicated the main weaknesses of the tool to focus on. As a result of this evaluation, historians that tried the tool offered very useful insights and comments on how to improve the tool and bring one step closer to supporting the needs of historical research. Most of the issues recorded concerned the user interface and the way the information is presented and visualized. Besides this ongoing corrective work, there are some truly challenging issues ahead. One of them is to provide proper support for the multilingual nature of the European archives content. Multilingualism has been modeled in the History ontology [11]. Our next step is to provide effective user-friendly support for multilingualism specification and presentation in the user interface.

15

Our future plans also include: Further investigation of the historical research method, so that to construct a set of ‘best practice’ patterns and recognize information pertinent to different historiographical issues; Further automation of the various stages in order to make Papyrus scalable to bigger document sets, richer models and easily-adaptable to different domains. We believe that the outcomes of Papyrus will not only provide a set of innovative solutions, but also allow us to discover some new, inspiring challenges as a result of this fruitful communication between computer scientists and history researchers. Acknowledgements This research was funded by PAPYRUS project (ICT-215874). We would also like to thank members of the Tensions of Europe (TOE community who have volunteered to evaluate Papyrus components and Papyrus consortium as a whole during various stages.

16

Notes 1

http://www.hssonline.org

2

http://www.historyoftechnology.org/

3

http://www.iptc.org

References [1] Papyrus Deliverable D2.2 – User requirements specification, http://www.ictpapyrus.eu/files/Papyrus-D2.2-v02.1.pdf [2] Katifori, A., Torou, E., Vassilakis, C., Halatsis, C., “Supporting Research in Historical Archives: Historical Information Visualization and Modeling Requirements”, In IV Proceedings of the 2008 12th International Conference Information Visualisation, pp. 32-37, 2008. [3] Torou, E., Katifori, A., Vassilakis, C., Lepouras, G. and Halatsis, C., “Capturing the historical research methodology: an experimental approach”, In Proceedings of International Conference of Education, Research and Innovation (ICERI 2009), Madrid, November 16-18, 2009. [4] Katifori, A., Tympas, A., and Mergoupi-Savaidou E. “Making History Courses Relevant and Attractive to Engineering and Science Majors by Bringing Archival Research Within Their Reach: The PAPYRUS Initiative”, in Proceedings of the International Technology, Education and Development Conference (INTED), 2009. [5] Bud, R., “Biotechnology in the Twentieth Century”, Social Studies of Science 21, no. 3 (1991): 415-457. [6] Papyrus Deliverable D3.1 – Ontologies for news and historical content, http://www.ict-papyrus.eu/files/Papyrus-D3.2-v1.93.pdf [7] NewsML-G2, http://www.iptc.org/cms/site/index.html?channel=CH0111 [8] Papyrus Deliverable D3.3 – Prototype for Ontology Matching and Mapping, https://hestia.atc.gr/papyrus/uploads/PAPYRUSInfo/FinalVersions/Manual_for_th e_Mapper.pdf; and Annex https://hestia.atc.gr/papyrus/uploads/PAPYRUSInfo/FinalVersions/D3.3_Annex.p df [9] The CIDOC Conceptual Reference Model: http://cidoc.ics.forth.gr [10]

IPTC news codes: http://www.iptc.org/NewsCodes/index.php

[11] Tsinaraki, C., Velegrakis, Y., Kiyavitskaya, N., Mylopoulos, J., “Whats in a Name? Polysemy in Conceptual Models”. Submitted to ER 2010. [12] Noy, N. F., McGuiness D. L., “Ontology Development 101: A Guide to Creating Your First Ontology”, Stanford Knowledge Systems Laboratory 17

Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, March 2001.

18

Third Paper Title: Integrating research into video and audio digitized archives into textual research: Examples from research on renewable energy Authors: Krishna Chandramouli, Roberta Turra, Giorgio Pedrazzi, Foteini Tsaglioti, Vaso Aggelopoulou and Aristotle Tympas Affiliation of authors:  Krishna Chandramouli, Ph.D., Research Assistant, Multimedia and Vision Research Group, Department of Electronic Engineering, Queen and Mary University of London, UK  Roberta Turra, Research Coordinator, Cineca, Italy  Giorgio Pedrazzi, Ph.D., Cineca, Italy  Foteini Tsaglioti, Doctoral Student, Graduate Program in the History of Science and Technology, National and Kapodistrian University of Athens and National Technical University of Athens, and, Scholar-in-Residence, Deutsches Museum, Greece and Germany  Vaso Aggelopoulou, Doctoral Student, Division of History of Science and Technology, Department of Philosophy and History of Science, National and Kapodistrian University of Athens, Greece  Aristotle Tympas, Assistant Professor, Division of History of Science and Technology, Department of Philosophy and History of Science, National and Kapodistrian University of Athens, Greece Abstract This paper will report on the outcome of designing and developing an interdisciplinary search platform also termed as the ‘Papyrus’ design so as to support research on digitized video and audio archives of media news agencies. Historical research on renewable energy and wind power in particular is chosen as the area of reference for building the prototype components. The key contribution of this paper is to present an interdisciplinary framework for aiding in the research on commonly unexplored audiovisual medium to historiographical issues so as to be sensitizing the researcher about the availability of material of relevance. Over the course of the presentation, we will present examples from our attempt on developing semantic analysis techniques to be used on multimodal content to link video, images and audio from news agency archival items to historiographical considerations. The data collection for the experimental evaluation has been obtained from Deutsche Welle and Agence France Presse broadcast archives, which contain video and audio archival material as a primary source.

1. Introduction The ongoing digitization of media and other textual archives is already changing research in the history of technology of science [Tzokas et al., this session]. At the same time, research in the history of technology and science (and research throughout the humanities and the social sciences, as well as research by journalists and other professionals and amateurs) is also changing by the availability of non-textual digitized media archives, namely video and audio archives. Addressing the challenges of developing an integrated framework for explicit exploitation of ever increasing audiovisual data, we present an overview of the Papyrus interdisciplinary search engine. In order to demonstrate the advantages for exploiting audiovisual information for investigating histographical issues, two domains of interest were selected, namely history of wind power and biotechnology (for the choice of the two Papyrus domains, refer to Katifori et al., this session). However, in this paper we will only focus on the wind power domain. A large amount of content received from Deutsche Welle (DW) and Agence France-Presse (AFP) contained audiovisual material. Therefore, in this paper we present an integrated Papyrus framework for extracting implicit knowledge embedded in these multimedia items. A detailed discussion on the construction of the Papyrus Ontology is presented in [Katifori et al., this session]. To set the stage for introducing these contributions, we briefly introduce the historiographical framework that was taken into account during the development of the multimedia framework. This framework is formed by combining general historiographical suggestions on how to properly study technological change,1 specific historiographical suggestions from the available historiography of wind power,2 and, suggestions concerning the increasing importance of audiovisual archives for the study of the history of recent technology and science.3 Central to this historiographical framework has been the hypothesis that audiovisual archives are not linear extensions of textual archives. There is historiographically important information that we can get only through audio and/or video sources. For example, videos on wind power can offer unique insight on issues concerning the rhetorical/narrative strategies that have been used to discuss the advantages and disadvantages of wind power. In the multimedia framework that we used to develop the Papyrus prototype, we found several such strategies. The list includes the simultaneous display (within the same picture) of old windmills and new wind farms so as to convey a sense of continuity between wind farms and windmills. It also includes the simultaneous display of wind farms and conventional energy generation installations, especially generation plants that generate visible amounts of smoke. In this case, the visual strategy aims at contrasting the two (wind power and conventional energy installations). In most cases, these strategies don’t make it to the text. To research them, a historian of wind power has to study video and/or audio material. Given the availability of a great amount of 2

video and audio, we have focused on how Papyrus could help this researcher to access it. The contributions outlined in the following sections of this paper are focused on the same example, namely Papyrus-assisted research on the relationship between windmills and wind farms as displayed in audio and video resources. In Figure 1, an integrated framework for audiovisual media processing is presented. The video item uploaded to the Papyrus repository is analysed with visual, audio and textual components. The visual analysis tools include shot boundary detection module, highlight extraction module, which is followed by low-level, and high-level feature extraction. On the other hand, the audio stream extracted from the video is further processed with analysis components namely speaker diarisation followed by speech recognition module. The textual transcript output extracted from the ASR module is further analysed to extract NER. In addition, additional processing components available for textual analysis include concept extractor, which could extract concepts such as “hill”, etc. The metadata generated from the analysis components are stored in the Papyrus metadata model. Based on the user query, the metadata model is searched and corresponding media items are extracted according to the user requirements.

Figure 1 - A generic framework for multimodal component 2. The video component analysis In addition to the above module, continuous research was carried out on developing a temporal segmentation algorithm using MPEG – 7 Colour Layout Descriptor (CLD) [1]. The MPEG – 7 CLD is a compact and resolution invariant representation of colour specifically developed for high-speed image retrieval. However, the computational effectiveness of the descriptor has often been exploited for temporal segmentation of the video. In general, the descriptor is designed to capture the spatial distribution of colour in 3

an image or an arbitrary shaped region. The spatial distribution of colour constitutes an effective descriptor for sketch based region image retrieval; content filtering using image indexing, and visualization. The functionality of this descriptor can also be achieved using a combination of grid structure descriptor and grid-wise dominant colours. However, such a combination would require a relatively large number of bits, and matching will be more complex and expensive. The CLD uses representative colours on a 8X 8 grid followed by a Discrete Cosine Transform (DCT) and encoding of the resulting coefficients. The feature extraction process consists of two parts; grid based representative colour selection and the DCT transform with quantization. The DC values are quantized to 6 bits and the remaining to 5 bits each. These results demonstrate that the CLD is quite effective in image retrieval. The results also compare favourably with a grid based dominant colour approach wherein the image is partitioned and dominant colours for these partitions are used to represent the layout. For matching between two CLD’s (DY, DCr, DCb) and DY ', DCr', DCb') , L2 measure is used. For detecting visually coherent scenes, a thresholding scheme is applied. From the analysis of the Papyrus videos, the challenges of developing a temporal segmentation module include the following to account for the transition of the objects in a scene to account for fades and dissolves in a shot to account for shot and scene changes Addressing the above challenges, and to detect fades and dissolves types of temporal segmentation, a time-delay module of the same has been developed. The time-delay module accounts for the slow change in the visual characteristics of the shot. In addition, the module also considers the transition (or camera span across a view) between shots to extract the shot boundaries. 3. The Keyframe Extraction For the extraction of keyframes from the video, a measure of visual dissimilarity is derived by implementing a supervised classifier. The visual dissimilarity derived between frame f A and f B is generated, by training the classifier with MPEG – 7 feature set of frame f A and frames successive to f A along temporal line such as f Ai where iÎ 0,1,..., N and N is the total number of frames in the video are presented to the classifier as a test set. The classifier output provides a measure of dissimilarity between frames f Ai with respect to f A . Hence, the algorithm is considered to be a supervised classification, with frame f Ai labelled as positive (or ‘1’) and the successive frames which belong to this class are clustered together as long as classifier assigns label ‘1’ to frames f Ai . If a frame in the sequence of f Ai is labelled as ‘2’ denoting a high change of visual dissimilarity, that 4

2

frame f Ai (where 2 denotes the label assigned by the classifier) is considered as the training sample for the successive frames. In Self Organising Maps (SOM),4 input patterns are fully connected to all neurons via adaptable weights and during the training process, neighbouring input patterns are projected into the lattice corresponding to adjacent neurons. SOM enjoys the merit of input space density approximation and independence of the order of input patterns. Like K-Means algorithm SOM also needs to predefine the size of the lattice. In basic SOM training algorithm the prototype vector are trained with equations (2-12). mi (t + 1) = mi (t) + hci (t)[x − mi (t)]

(1)

Where m is the weight of the neurons in the SOM network hci (t) is the neighbourhood function that is defined in (2-13).  || rc − ri ||2  hci(t) = α (t)exp  2  2α (t) 

(2)

Where, α (t) is the monotonically decreasing learning is rate and r represents the position of the corresponding neuron. From the experimental results, it was noted that using a single layer SOM elimination of true negative images by the classifier was limited to those feature vectors, which are represented by the term x − mi (t) in the training function. Hence, we propose a Dual Layer SOM (DL-SOM) to improve the performance of the SOM. The algorithm workflow and DL-SOM network structure is presented in Figure 2. The evaluation function for the second layer is presented in equation (2-14). mi (t + 1) = mi (t) + hci (t)[x + mi (t)]

(3)

Figure 2 - Dual Layer SOM and PSO based Highlight detection 5

The output of the classifier is a measure of visual dissimilarity from the classifier. The output is further analysed by filtering the values through a high pass filter by detecting the positive slope encountered in the results. Then the corresponding frames are selected as the key frames or visual highlight of the video. In the workflow, there is a misclassification tolerance of 3 frames, which was experimentally determined. Also, these frames are used in the further analysis of feature and event detection. Since, scene filtering is achieved based on the algorithm of feature detection, the next section will present the feature extraction algorithm [2]. For the extraction of high-level visual features and events, a rectangular mesh structure is trained with both positive and negative samples from the pre-defined training models. The feature detector is a binary classifier, assigning labels to the input feature vectors. The network structure is presented in Figure 2, where X is the input feature vector. The training of the network neurons is performed using particle swarm optimization. The input feature vector from the training model is presented to the network. The winner node based on the competitive learning is selected. The features from the selected winner node and the input training feature are presented to PSO. The d − dimension optimization problem to be solved by PSO is the L1 metric between the winner node feature vector to the input feature vector. The particle swarm for each dimension of the input feature is initialized randomly. The evaluation function for each particle in each dimension is calculated and accordingly the pbest and gbest values for the particle swarm is updated. The velocity and position of each particle in each dimension is updated. The iteration is continued until the result of the evaluation function is less than threshold eth . The training of the algorithm is continued until all the input patterns from the training models are exhausted. 4. Classification model One of the key challenges in developing an automatic classification model is the presence of “Semantic Gap”, which is succinctly defined as the gap between low-level features and high-level semantic features. Addressing this problem, a large number of indexing and retrieval algorithms have been presented in the literature. Although the performance of the machine learning techniques has been largely improved, the machine learning outcomes are still a far away from the results generated by human cognition. In tackling the problems of enhancing the performance of machine learning algorithms, recent developments in optimisation techniques have been inspired by problem solving abilities of biological organisms such as bird flocking and fish schooling. One such technique developed by Eberhart and Kennedy is called “Particle Swarm Optimisation (PSO)”. In comparison to other evolutionary computation algorithms, the PSO algorithm considers the following two main assertions as listed below [3]:

6



Mind is Social: Learning from experience and emulating the successful behaviours of others, people are able to adapt to complex environments through discovery of relatively optimal patterns of attitudes, beliefs and behaviours.



Particle swarm are a useful computational intelligence methodology: Central to the concept of computational intelligence is system adaptation that enables or facilitates intelligent behaviour in complex and changing environments. Swarm intelligence comprises of three steps namely evaluate, compare and imitate. Each particle goes through these stages by performing simple mathematical operations in solving a more complex optimisation problem.

Following the advantages listed above for the use of PSO algorithm, a Self Organising Map (SOM) based visual classifier has been developed and integrated in the Papyrus system for semantic indexing of the visual medium. The neural network architecture is based on the nervous systems component and can be categorised as feedforward, feedback and competitive [4]. Feedforward networks transform a set of input signals into a set of output signals. The desired input-output transformation is usually determined by external, supervised adjustment of the system parameters. In feedback networks [7], the input information defines the initial activity state of the feedback system, and after state transitions the asymptotic final state is identified as the outcome of the consumption. In competitive learning networks, neighbouring cells in a neural network compete in their activities by means of mutual lateral interactions and develop adaptively into specific detectors of different signal patters. In competitive neural networks, active neurons reinforce their neighbourhood within certain regions, while suppressing the activities of the other neurons [5]. This is called oncenter/off-surround competition. The objective of SOM is to represent high-dimensional input patterns with prototype vectors that can be visualised in a usually two-dimensional lattice structure [6]. Each unit in the lattice is called a neuron, and adjacent neurons are connected to each other, which results in a clear topology of how the network fits itself to the input space. Input patterns are fully connected to all neurons via adaptable weights and during the training process, neighbouring input patterns are projected into the lattice, corresponding to the adjacent neurons. SOM enjoys the merit of input space density approximation and independence of the order of input patterns. A detailed discussion on the implementation of the classifier has been presented in [8]. In Figure 3, an overview of semantic concept co-existance is presented. A thorough evaluation of the video analysis components has been presented in [9]. 5. The audio component analysis The analysis of the audio component of video items achieves two main objectives: on one side, it complements the visual component analysis enabling, through a multimodal analysis, the generation of higher level, semantically relevant, metadata and, on the other 7

side, it provides a semantic indexing of video items similar to the one provided for textual news items, in order to make them available in a uniform, coherent manner. Multimodal analysis exploits the combination of visual and audio features extracted from the digital media and the interaction between different layers and data streams present in the same multimedia document to provide semantic categories extracted by the combination of multiple modalities. Audio features, in particular, provide the basic content structure by identifying video segments characterized by narration, interviews and noise or music. Segments are labelled by type (e.g. speech / non speech), gender (male / female) and speaker. Speech segments are further analysed through a speech recognition process to provide the topic being discussed, the main concepts expressed and the related Named Entities.

Figure 3 – An example of classification models (with concepts, windfarm, sky, windmill and vegetation) The following paragraphs describe how each task has been achieved, starting from the speaker segmentation process that is the basis for content structuring and speech segments identification, followed by a description of the speech recognition process that generates 8

transcriptions which can finally be analysed by Natural Language Processing techniques. The last paragraph illustrates a novel method for analysing textual content based on Wikipedia as a linguistic resource that has been tailored for speech transcriptions to reduce the impact of speech recognition errors on the metadata generation process. Speaker segmentation Speaker segmentation, also known as speaker diarisation, refers to the process of automatically transcribing a given audio data source in terms of “who spoke when” [10] giving an insight on audio items structure by identifying segments with homogeneous audio features and by providing a descriptive label of their content (e.g. “speech”, “male”, “speaker A”, “noise” ...). A typical Speaker Diarisation system conceptually performs these tasks: • Audio Feature extraction: features extracted from the audio stream are intended to suggest information about the speakers in order to enable the system to separate them optimally. • Speech activity detection: an audio stream may consist of some acoustic activities like speech, noise, music, background conversation and silence. Non-speech regions should be detected and removed from the audio stream. • Speaker change detection: inside every speech region, a speaker change (or speaker turn) detector is used to find points in the audio stream which are candidates for speaker change points. • Gender detection: it allows, for the segments classified as speech, to detect if the speaker is a male, a female or a child. • Speaker clustering: segmented regions, belonging to the same speaker, are grouped together. This does not entail whether such segments come from the same acoustic file or different ones. These tasks can be performed by different algorithms applied in different order, mixed together and repeated iteratively. By structuring the audio/video stream into speaker turns, speaker diarisation has already proven its usefulness for the indexation of broadcast news, and multimedia objects in general, making possible, for example, to track people across recordings. Speaker diarisation is also useful as a preliminary step in the task of automatic transcription. The usual output of a speaker diarisation system is a list of time slices (usually represented by their start and end time, or by their start time and the duration) with a description of each slice (usually represented as a set of tag). This information, either alone or integrated with information extracted from other modalities, may contribute extensively to the overall semantic interpretation of multimedia data [11]. 9

The Papyrus Speaker Diarisation Framework (PSDF) includes tools for audio format conversion, features extraction, speaker segmentation, speaker clustering, speech activity detection and gender detection. The PSDF provides three algorithms that implement a complete speaker diarisation system, in order to provide the most reliable results to the content structuring and multimodal analysis. Speech activity detection is a central task in speaker diarisation and evaluation measures, like the Diarisation Error Rate (DER), are directly affected by the performance on this task. Speaker diarisation is often used for speaker tracking and speech activity detection allows a finer tracking by excluding audio regions where the speaker is not talking. Moreover speech activity detection also helps to avoid confusing homogeneous noise segments with a speaker. The PSDF includes an implementation of the most common techniques for speech activity detection. Among these, the most robust to noise and context have been selected and used for the speaker diarisation algorithms. In the PSDF two gender classification tools are also available that aim to divide the segments into common groupings of gender in order to supply more side information about the speakers in the final output. Timing information of audio segments, speaker labels, speech/non-speech tags, male/female tags are all metadata provided by the PSDF and can be further used combined with the analysis results of other modalities. Automatic speech recognition Automatic Speech Recognition (ASR) is the process of converting spoken words to text. ASR supports the conceptual querying of video content and the synchronization to any kind of complementary resource. The potential of ASR-based indexing has been demonstrated most successfully in the broadcast news domain [12]. In fact, despite several years of research in this field, ASR systems work reliably only under rather constrained conditions, where restrictive assumptions, described in table 1, can be made. States of art performance levels for Large Vocabulary task (i. e. Broadcast News speech) are between 10-20% Word Error Rate (WER) depending on the language, type of speech and audio quality. For other domains, like in the Papyrus case, values under 50% are difficult to obtain [13]. The ASR task of the Papyrus project is a continuous, spontaneous, large-vocabulary speech recognition task of different speakers, over different channels in a noisy environment.

10

Factors

Best case

Worst case

Papyrus case

Vocabulary size

Small vocabulary

Large vocabulary

Large vocabulary

ASR type

Dictation

Continuous speech

Continuous speech

Speech type

Reading

Spontaneous

Both

Speaker accent

Perfect match with Non-native speakers Both the acoustic model outside the acoustic training set model training set

Channel characteristics

Microphone

Telephone

Environment

No kind of noise

Noise, Music, Noise, Music, Overlapping speech Overlapping speech

Microphone

Table 1- Main factors affecting ASR performance Therefore the task requires a system stable to different environment conditions, that doesn't need training on individual speaker's voice and stable to different speaker accents. The mean WER for Papyrus videos is 48,1% with a high variability from video to video due to different audio and speech conditions, ranging from read speech in a studio environment to spontaneous speech under noisy acoustic conditions. This guarantees, anyway, sufficient accuracy for a robust textual analysis [14].

Figure 4 – ASR Workflow In Figure 4 the whole process of audio analysis is described, with some details on components affecting ASR. An ASR engine requires, for a given language: 1. Acoustic model: describes the basic sounds units of the language (phonemes) 11

2. Vocabulary: describes possible pronunciations of all the words of the language 3. Language model: describes how the words are related to each other in the language Results from different software (Sphinx3 from Carnegie Mellon University and Sonic from the Colorado University), using different parameters, were compared for the five videos (English version) for which the reference texts had been provided. The best results have been obtained using the acoustic model HUB4 distributed with the latest version of Sphinx4 for Java5 and a language model that combines the lm_giga_64k language model and a language model specific for energy (Energy). Pronunciations for words in the Energy language model but not in standard CMU dictionary (7a) have been added to the vocabulary. The specific language model for energy has been generated starting form Deutsche Welle and Agence France Presse news on renewable energy and then combined with the lm_giga_64k language model6. The language model resulting from the combination accounts for 67000 words, 3000 of which specific to the energy domain. This combination of resources, specifically tailored for the energy domain, improved recognition accuracy of 10% with respect to the standard resources. Concept Mapping While keyword extraction has been widely investigated in the text domain, there is less effort on speech transcripts [15]. Linguistic analysis of speech transcriptions is affected by a) the word recognition errors, b) the lack of punctuation and c) the lack of linguistic structure that characterizes the spontaneous speech. A method is therefore necessary to reduce errors and increase precision in the metadata generation process. The proposed method mainly relies on Wikipedia as a validation tool of the extracted linguistic constructs in terms of meaningfulness and relevance to the context. The method is implemented in a specific tool, the Concept Mapper, which has been developed for analysing both textual news items and audio transcripts to identify the most relevant concepts and to connect them to the proper ontology identifiers. Textual news items and audio transcripts are treated differently in the concept selection stage. Links to the historiographical issues are achieved indirectly through the mapping between the History Ontology and the News Ontology. The Concept Mapper role is to map news content to the News Ontology, by leveraging textual and spoken language technologies and complementary resources such as Wikipedia. Wikipedia provides both a linguistic resource and a source of additional metadata for semantic indexing. For each detected concept, in fact, the tool provides the following information:

12

• The keywords that were identified as representative of the concept and their frequency in the news item • Additional information provided by the Wikipedia page that describes the concept:  Title  Translations to other languages  Anchors text and Redirects  Categories • A score of relevance for the news item (internal relatedness) • A score of relevance for the domain (external relatedness) • The ontology identifier of the concept The Concept Mapper is implemented in four steps: 1) candidate keywords extraction (noun phrases are selected using a shallow parsing procedure [16]) 2) “anchor search” in Wikipedia content (exploiting all the available alternative ways of referring to the same concept, i.e. the anchors) and candidate keywords disambiguation [17] (whenever a noun phrase refers to more concepts, or Wikipedia pages, the one that best relates to the news item is chosen, using the internal relatedness measure [18]as a semantic proximity index) 3) keyword ranking and selection (internal and external relatedness are used to choose which of these concepts are relevant enough to the story and to the Papyrus domain to be retained as semantic metadata) 4) ontology connection (an ontology identifier is associated to each concept, when available)

13

Figure 5 - Architecture of the concept mapper The main issues in this process are the detection of erroneous nominal phrases across sentences (due to the lack of punctuation) on one side, and the loss of correct nominal phrases due to the speech recognition errors and to repetitions and stammering of the spontaneous speech, on the other side. While the loss of information is not easily recoverable and will affect the system recall, the detection of erroneous chunks can be reduced by filtering the results with a Wikipedia validation process. This will improve the system precision, avoiding most of the speech recognition mistakes to affect the metadata generation. The validation process is essentially a two step process. In the first step, only nominal phrases that are linkable to a Wikipedia page are kept, since this makes possible to assign a meaning to the nominal phrase. Even if spurious nominal phrases are eliminated at this stage, it is still possible that irrelevant chunks are kept. The second step aims at identifying them by checking whether their meaning is pertinent to the context (both the internal context of the news item and the external context of the news domain). For this purpose the internal and external relatedness measures can be used, as well as the frequency of the nominal phrase. Spoken language is indeed more redundant than the written one and repetitions of terms within a shot prove their relevance even when the ASR texts include errors and lack of structure. To identify the most appropriate criteria for nominal phrases selection, a manual annotation of the reference texts is necessary in order to define the “reference nominal phrases” that the system should be able to extract. The “reference nominal phrases” are those concepts that mostly reflect the news content and are agreed on by domain experts. Once the list of “reference nominal phrases” is available, different selection criteria (based on the internal relatedness, external relatedness, frequency, commonness, confidence and 14

any appropriate combination of them) can be compared in order to maximise precision and recall of the concepts retrieved from the transcripts. To illustrate the process of criteria selection, results obtained analysing the longest available video item can be presented. The analysis of the reference text of “332134 6 2007 made in germany schottland ausbau windenergie english” led to the (manual) identification of 39 relevant concepts, among the 103 nominal phrases that were actually identified and had a corresponding page in Wikipedia. These can be considered representative of the news content. Among them are: renewable energy, Scotland, Scottish Power, rural, hill, sight, tourism, turbine, wilderness, wildlife, wind farm and wind power. The textual analysis of the ASR transcription led to the automatic identification of 123 nominal phrases (after Wikipedia 1st step validation). Since only 20 of them correspond to the “reference nominal phrases”, this implies that the concept recall is 51,3% (20 correctly identified concepts over 39 reference concepts) and cannot be improved. On the other hand, the precision value of 16,3% (20 correctly identified concepts over 123 identified concepts) can be improved by identifying the criteria to select most of the correct concepts out of the automatically retrieved ones. In order not to affect the system recall too seriously, the F score, instead of the precision, will be maximized. Figure 6 shows the F score trend, for each selection criteria, as the selection threshold decreases and, consequently, the number of selected nominal phrases increases. The graph shows how the F score is maximised by all the criteria by selecting a number of nominal phrases around 20. Furthermore, the graph shows that, the external relatedness of the chunk, multiplied by the number of its occurrences, achieves the highest F score at almost any level of the threshold (and number of selected chunks). In particular, the maximum is reached at the threshold 0,45 of the “frequency * external relatedness” measure, selecting 21 chunks: as 15 of them match with the “reference nominal phrases”, the recall is 38,5% and the precision 71,4%. The final criteria that has been implemented is therefore to select from ASR transcriptions only nominal phrases with external relatedness (multiplied by the frequency) above 0.45. The better performance of the external relatedness with respect to the internal one is anyway justified by the presence of speech recognition errors that affect the news item internal context, favouring the external (domain) context as more reliable. From a qualitative point of view, the retrieved nominal phrases cover all the main topic of the news item and, with respect to precision, it should be noticed that most of the “erroneously” identified concepts are actually concepts correctly identified and correctly disambiguated that don’t satisfy the chosen relevance requirement (euro, Europe, meter, engineering, people, pipeline transport). Thus discrepancies between the manually assigned relevance and the system generated relatedness account for most precision errors (relevance overestimation) and for a large fraction of recall errors (relevance 15

underestimation). With respect to this, it should be noticed that concept relevance is highly subjective and that the observed disagreement on the degree of relevance falls within the inter-rater assessed disagreement (Cohen’s kappa coefficient of 0.45). Concerning the geographical locations, in particular, since their identification is one of the multimodal analysis objectives, the method described enables the retrieval of generic locations (e.g. hills, shore, sea, mountains, countryside and etc.) as well as nations and main regions and towns, but mostly fails on small towns as ASR vocabularies don’t provide the pronunciation for them (nor Wikipedia provides a page for them). The selected concepts, together with the additional Wikipedia related information, provide semantic metadata to the news item that enable a semantic search. Furthermore, the Concept Mapper identifies, for each selected concept, the proper News Ontology link by exploiting the ontology structure and maximising the group relatedness, i.e. the semantic proximity between the concept and the group of concepts (synonym set) as defined in the ontology [19].

Figure 6 – Comparison of selection criteria The reason for using a Wikipedia annotation as intermediate step to obtain an ontology annotation is due to the fact that the Papyrus News Ontology is a domain restricted ontology (does not comprehend “tourism” for example, although it can be a quite meaningful metadata) and that it doesn’t provide confidence measures for filtering the textual analysis results. In the “CLS 332134 6 2007 made in germany schottland ausbau windenergie english” video, for example, the spoken word “paradise” is erroneously recognised as “Paris” which would be validated by the ontology, whereas filtering it with the internal relatedness eliminates it (the video is actually set in Scotland). Therefore, the proposed method automatically identifies relevant concepts in textual documents and automatically maps them to their formalization in a given domain ontology. This enables automatic annotation of texts and semantic metadata generation 16

exploiting both Wikipedia knowledge and the Ontology knowledge. This method has already been implemented in the Papyrus (Cultural and Historical Digital Libraries Dynamically Mined from News Archives) prototype to provide metadata generation for the semantic search functionality and to provide content mapping to the News Ontology for the cross discipline search functionality. It analyses both textual content and speech transcripts in English and French, in two domains (renewable energy and biotechnology) and can easily be extended to other languages and domains. The temporal information that is provided along with the metadata enables the semantic indexing and also the synchronization of video and audio segmentations. This allows improving metadata quality as well as scene segmentation, by simultaneously taking into account the information provided through different modalities and is part of the ongoing research activity. 6. Papyrus Interdisciplinary Search Engine The audiovisual framework presented in this paper has been integrated into an online Papyrus interdisciplinary search engine.

Figure 7 – An overview of the Papyrus search engine interface

17

7. Conclusion and future work In conclusion, the integrated framework presented in this paper provides an easy and flexible access to the previously unexplored audiovisual items for research in issues of historiographical importance. The integrated framework is a part of the Papyrus interdisciplinary search engine, which can be accessed through the online portal.7 In future we will focus on evaluating the performance of the integrated system along with other usability issues.

Figure 8 – The screenshot of the Papyrus results page for the query “windmill”

18

Figure 9 – A screenshot of the audiovisual metadata

19

Notes 1

For a review of such suggestions, see Aristotle Tympas, “Methods in the History of Technology”, in Colin Hempstead (ed.), Encyclopedia of 20th Century Technology, New York: Routledge, (2005), pp. 485-489. For suggestions concerning the history of technology in Europe, see the articles in a special issue of History and Technology 21, no. 1 (2005). 2

For a sample of books on the history of wind power, see T. Lindsay Baker, A Field Guide to American Windmills, Norman: University of Oklahoma Press, 1985, Richard Hills, Power from Wind: A History of Windmill Technology, New York: Cambridge University Press, 1994, Matthias Heymann, Die Geschichte der Windenergienutzung 1890-1990, Frankfurt: Campus-Verlag, 1995, Robert Righter, Wind Energy in America: A History, Norman: University of Oklahoma Press, 1996. For insightful historiographical suggestions, see also Matthias Heymann, “Signs of Hubris: The Shaping of Wind Technology Styles in Germany, Denmark, and the United States, 1940-1990”, Technology and Culture 39, no. 4 (1998): 641-670 and Geert Verbong, “Wind Power in the Netherlands 1970-1995”, Centaurus 41, no. 1-2, (1999): 137-160. 3

See the references in [Tzokas et al., this session], footnotes 7-10.

4

A brief discussion on the theoretical motivation for the use of SOM is presented the next section. 5

Other acoustic models (old version of HUB4, Voxforge and WSJ) showed degrading performance.

6

http://www.inference.phy.cam.ac.uk/kv227/lm_giga/

7

http://iris.atc.gr/CMS_Papyrus_1_1/

References [1] Manjunath, B. S., P. Salenbier and T. Sikora, Introduction to MPEG – 7, Multimedia content description interface, New York: Wiley, 2003.

[2] Chandramouli, K. And E. Izquierdo, “Visual Highlight Detection using Particle Swarm Optimisation”, Latin-American Conference on Networked and Electronic Media, 2009.

[3] Kennedy, J. and R. C. Eberhart, Swarm Intelligence, San Francisco, CA: Morgan Kaufmann, 2001. [4] Rumelhart, D. E., G. E. Hinton, R. J. Williams, “Learning internal representations by error propagation”, In D. E. Rumelhart and J. L. McClelland (eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, (vol. 1, pp. 318-362.), Cambridge, MA: MIT Press, 1986.

20

[5] Hopfield, J. J., “Neural networks and physical systems with emergent collective computational abilities”, Proceedings of the National Academy of Sciences, 79:2554-2558, 1982. [6] Inoue, M., “Image Retrieval: Research and use in the information explosion”, Progress in Informations 6 (2009): 3-14.

[7] Kohonen, T. “The Self Organising Map”, Proceedings of IEEE 78, no. 4 (1990): 1464-1480 [8] Chandramouli, K. and E. Izquierdo, “Image Retrieval using Particle Swarm Optimisation”, in M. C. Angelides, P. Mylonas and M. Wallace (eds.), Advances in Semantic Media Adaptation and Personlisation, (pp. 297-319), CRC Press, 2009. [9] Chandramouli, K., et al., “Techniques for Multimodal content analysis”, Technical Report, 2009.

[10] Tranter, Sue E. and Douglas A. Reynolds, “An overview of automatic speaker diarization systems”, IEEE Transactions on Audio, Speech, and Language Processing 14, no. 5 (2006): 1557-1565. [11] Friedland, G., H. Hung and C. Yeo, “Multi-modal Speaker Diarization of Real-World Meetings Using Compressed-Domain Video Features”, Tech.Rep. 08007, ICSI, October, 2008. [12] Huijbregts, M.A.H. and Ordelman, R.J.F. and de Jong, F.M.G., “Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition). In Proceedings of the Second International Conference on Semantic and Digital Media Technologies, SAMT 2007, 5-7 Dec 2007, Genoa, Italy, 2007. [13] Rehatschek, H. and Sorschag, R. and Rettenbacher, B. and Zeiner, H. and Nioche, J. and de Jong, F.M.G. and Ordelman, R.J.F. and van Leeuwen, D., “Mediacampaign: A Multimodal Semantic Analysis System for Advertisement Campaign Detection”. In: Proceedings of international workshop on ContentBased Multimedia Indexing, CBMI 2008., 18-20 June 2008, pp. 85-92, London, UK. [14] Garofolo, J.S., C.G.P. Auzanne, and E.M. Voorhees, “The TRECSDR Track: A Success Story”, In Eighth Text Retrieval Conference, pp. 107-129, Washington, 2000. [15] Liu, F., D. Pennell, F. Liu, and Y. Liu, “Unsupervised approaches for automatic keyword extraction using meeting transcripts”, in NAACL ’09: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics,

21

Morristown, NJ, USA, 2009, pp. 620–628, Association for Computational Linguistics. [16] Schmid, Helmut, “Probabilistic part-of-speech tagging using decision trees”, In Proceedings of the International Conference on New Methods in Language Processing, pp. 44–49, 1994. (available at http://www.ims.unistuttgart.de/ftp/pub/corpora/tree-tagger1.pdf) [17] Milne, D. and I. Witten, “Learning to link with Wikipedia” in CIKM ’08: Proceeding of the 17th ACM conference on Information and knowledge management, New York, NY, USA, 2008, pp. 509–518, ACM. [18] D. Milne and I.Witten, An effective, low-cost measure of semantic relatedness obtained fromWikipedia links, 2009.

[19] Reiter, Nils, Matthias Hartung, and Anette Frank, “A Resource-Poor Approach for Linking Ontology Classes to Wikipedia Articles”, in Johan Bos and Rodolfo Delmonte (eds.), Semantics in Text Processing. STEP 2008 Conference Proceedings, vol. 1 of Research in Computational Semantics, pp. 381–387, College Publications, 2008. (available at http://www.aclweb.org/anthology/W082231)

22