First International K-Teams Workshop on Semantic

Stefan Trausan-Matu (Ed.)

First International K-Teams Workshop on Semantic and Collaborative Technologies for the Web

Organized and supported by the ERRIC FP7-REGPOT-2010-1/264207 EU Research Project

Politehnica University of Bucharest Bucharest, ROMANIA 21 - 22 June 2011

2

Preface

After decades in which a major concern in Artificial Intelligence (AI) was knowledge representation and processing the focus was changed from the individual towards the a social knowledge-building vision, empowering a socio-cultural perspective. Some ideas were introduced by Marvin Minsky in his model of the society of mind. Other ideas were developed in multi-agent systems. Machine learning, based on inductive methods may be seen also as way of extracting knowledge from “social” data. The appearance of Internet and of the World Wide Web made the change more dramatically. The possibility of rapid communication and of the development of huge social networks changed totally the perspective. Even if Tim Berners-Lee said that the second generation of the web will be the Semantic Web, based on classical AI knowledge representation and processing, the reality shows that the actual Web2.0 is the Social Web, based on supporting communities, social networking and collaboration. If the Semantic Web has as a major point the possibility of representing and processing of text (web pages) semantics, in a knowledge-based tradition (e.g. using ontologies), the Social Web is based on supporting humans to communicate. Both of them are essentially based on language, but on two different perspectives: The first starts from the presupposition of automatic language processing (or, at least, of annotating web pages with metadata). The second encourage people to use language in virtual communities and, if we consider work settings, in virtual teams building knowledge. Our vision, supported also by the idea of the K-Teams Workshop is that the two paradigms should be integrated for supporting knowledge building in virtual teams. The primary goal of the K-Teams workshop was to bring together researchers in the fields of Semantic and Social Web in order to share knowledge (approaches, models, issues, solutions, applications). Debates are expected on directions and possibilities of future research in the domain. We also expect to create a forum for further collaboration on this field of study. The workshop was organized and supported by the ERRIC (Empowering Romanian Research on Intelligent Information Technologies) FP7 EC-funded Research Project. It took place on 21-22 June 2011 at the Computer Science and Engineering Department of the Politehnica University of Bucharest, Romania. The participation was open to professors, researchers and students interested in the topics of the workshop.

June 2011 Stefan Trausan-Matu

3

Organizing Committee Stefan Trausan-Matu, chairman Adina Magda Florea Traian Rebedea Mihai Dascalu Vlad Posea

Program Committee Paul Chirita (Adobe Romania) Valentin Cristea (Politehnica University of Bucharest, Romania) Mihai Dascalu (Politehnica University of Bucharest, Romania) Philippe Dessus (Universite Pierre-Mendes, France) Ciprian Dobre (Politehnica University of Bucharest, Romania) Adina Magda Florea (Politehnica University of Bucharest, Romania) Wolfgang Greller (Open University The Netherlands The Netherlands) Mounira Harzallah (University of Nantes, France) John-Jules Meyer (University of Utrecht, The Netherlands) Florin Pop (Politehnica University of Bucharest, Romania) Costin Pribeanu (Institutul de Cercetare in Informatica, Romania) Florin Radulescu (Politehnica University of Bucharest, Romania) Traian Rebedea (Politehnica University of Bucharest, Romania) Stefan Trausan-Matu (Politehnica University of Bucharest, Romania) Julien Velcin (University Lyon2, France) Fridolin Wild (Open University of UK, UK)

Additional reviewers Costin Chiru (Politehnica University of Bucharest, Romania) Vlad Posea (Politehnica University of Bucharest, Romania)

DTP by Mihai Dascalu (Politehnica University of Bucharest, Romania)

4

Table of Contents John Lee Content, Conflict, Control: Semantics and Subversion

5

Khurshid Ahmad Affect, Ontology and Sentiment Analysis

12

Traian Rebedea, Mihai Dascalu, Stefan Trausan-Matu and Costin-Gabriel Chiru Automatic Feedback and Support for Students and Tutors Using CSCL Chat Conversations

20

Saida Kichou, Youssef Amghar and Hakima Mellah Tags Filtering Based on User Profile

33

Matei Popovici, Mihnea Muraru, Alexandru Agache, Cristian Giumale and Lorina Negreanu Modeling with Fluid Qualities Formalisation and Computational Complexity

45

Cristian Turcitu and Florin Radulescu A Multidimensional Perspective on Social Networks

60

Nicolae Nistor Quantitative Models of Communities of Practice. A Call for Research

70

Iulia Pasov, Stefan Trausan-Matu and Traian Rebedea A Rule-based System for Evaluating Students’ Participation to a Forum

79

Stefan Daniel Dumitrescu, Stefan Trausan-Matu, Mihaela Brut and Florence Sedes A Knowledge-based Approach to Entity Identification and Classification in Natural Language Text

91

Laura Dragoi and Florin Pop Information Retrieval System for Similarity Semantic Search

107

5

Content, Conflict, Control: Semantics and Subversion John Lee1 1

School of Arts, Culture and Environment, and School of Informatics University of Edinburgh [email protected]

Abstract. Collaboration requires that information can be shared, transformed and reconciled across differing representations. Communication needs structure and control, but needs also to stimulate creativity and diversity. This continual tension was pointed out by Bakhtin. Formal representation systems face the same issue, being continually subverted as differing ontologies are brought into contact, and often conflict. Examples relating to construction, learning and graphical communication illustrate the generality of these issues. Potential ways are emerging to address, and even to derive benefits from, the clash of ontologies. Keywords: communication, collaboration, ontologies, learning, structure mapping.

1

Communication and subversion

Increasingly we face the need to collaborate around resources. This seems to be an inevitable focus in many different contexts. In learning, students need to be offered, and to share, discuss and transform, a wide range of resources of many different types. In the workplace — for instance, the construction industry — information is collected in many places, in many forms, and people need to access, discuss, abstract and augment this information in a huge variety of ways. Often the “semantic web” raises issues concerned with seeking unknown resources; but often, also, it can be about finding and supporting diverse ways to work with resources about which much is already known. The focus in these sorts of cases falls on communication, and on how it can be structured. There are competing objectives, in many instances, even where there may not appear to be at first sight. In particular, one objective may be to facilitate the effective use of the resources, to automate various ancillary functionalities, and hence to control the structure and content of communication; whereas another objective may be to enhance creativity, foster emergent viewpoints, and encourage diversity of interpretation. The latter kinds of objectives are especially likely in areas such as design and learning, of course; but wherever they arise, they are in tension with the former kinds. We see this kind of tension, for example, on the building site. In the construction industry, as discussed elsewhere [7], there is a natural inclination to suppose that overarching standardisation and automation will be helpful in improving the

6

efficiency and accuracy of communication, and in enforcing compliance with various standards, codes, regulations and legislation. However, these goals are subverted by the fact that in detail all building sites are different, and by the relentless tendency of groups to evolve their own conventions for interpretation, usage and practice. Moreover this subversion is commonly in fact desirable because it helps to accommodate local difficulties that are unforeseen by, and perhaps incommensurable with, the formal schemata of the documentation. Subversion is ultimately necessary to allow creative responses, innovation and development at both local levels and more generally. It is crucial for designers, whose stock in trade is to challenge and extend accepted notions of the limitations and even definitions of common concepts [12]. The importance of subversion is often overlooked in contexts such as construction. There is a tendency to assume that things can be defined, and once defined fixed, and that order and regularity can be imposed. The illusory nature of such assumptions is of course well recognised in many other contexts and traditions. Bakhtin is famous for having celebrated the notion of "Carnival" as an emblem of that which cannot be regulated in interaction. This might be envisaged as an occasional outbreak of mayhem, but he sought to distinguish it from transient spectacle. Rather he alludes to an ongoing process of subversion in which “a juxtaposition of marginalised and official discourses” [15] is central — a process that from his point of view is part and parcel of the notion of all human communication as "heteroglossia": a polyphony of voices vying to establish temporary dominance, or perhaps even consensus, but destined inevitably to continued dialectical struggle. And this struggle, as we have observed, guarantees that the door is always slightly ajar to innovation and creativity. The price of too much order is sterility: examples are legion. If we reflect, we know that periodic revolution establishing fresh standards and then a renewed period of sterility is also ultimately less fruitful, albeit perhaps temporarily more peaceful and less confusing, than a broad consensus that is nonetheless incessantly transformed by continual carnival underneath. The point of this in the current context is that the representation of the structure of a domain, of communicational content — of meaning — is not something that can be fixed. It needs to be susceptible of continual negotiation. Where we define formal systems that support these representations, we need also to define systems and processes that can support their renegotiation. 2

Representation and mismatch

It’s therefore possible to suggest [7] that an important role for the formal characterisation of communication is in identifying areas where there is mismatch in understandings, tension and conflict, and in pinning down the nature of the disagreement. Similarly to systems that try to enforce standardisation, the formal framework may be based on a defined ontology for the domain; but now it is seen as a temporary and local snapshot of a conceptualisation that needs to be reconciled with others, offering terms in which concepts can be seen to differ while their relationships can also be captured. It affords, moreover, a hook on which to hang an account of reasoning and argumentation that arises around resolving the issue.

7

An idea rather like this has been studied in the context of the semantic web by McNeill et al. [8]. Their idea is that agents who come into dialogue about some specific issue — say, a flight booking — will commonly find that their definitions of some key terms in the interaction will differ slightly. In other words, they will have slightly different ontologies underlying their representations of the domain in question. A formal characterisation of the ontology is provided, in which the characterisation of types of component entities and relations (the signature) is distinguished from the basic information about instances of these in a given situation, and rules describing possible actions (the theory). Situations are considered in which interacting agents are developing a plan, e.g. of the actions required to book tickets, and at some point a failure is encountered. The reason for the failure can be interactively identified by the agents as some specific mismatch between their ontologies, in either signature or theory. A procedure can then be defined that will refine one of them in order to remove the problem. It’s noted that this procedure in general is restricted to dealing with situations where the ontologies are on the whole quite similar and the discrepancies therefore minor. In many more complex (and perhaps realistic) cases it will be difficult to provide algorithmic approaches to resolution. Bundy and McNeill [1] propose that developing more flexible ways to resolve such issues by changing an ontology’s underlying syntax and semantics is one of the grand challenges facing research in artificial intelligence: “Understanding and implementing this ability must be a major focus of AI for the next 50 years.” (87) But in the meantime, even if the resolution can’t be automated and has to be pursued by human agents, the identification of the point of conflict, and assisting the recording of the rationale for whatever resolution is achieved along with the nature of that resolution, may be highly valuable. The notion of capturing “design rationale” for various purposes is well established in the design research literature, and sometimes practice. Commonly this has been related to argumentation systems, which also often produce an emphasis on dialectical processes intended to expose and resolve disagreements [3, 2]. Usually the focus is on capturing the reasons for particular design decisions. It’s therefore a very general notion that can emerge differently in a variety of contexts. Here we simply note that grounding rationale, where appropriate, in a framework of ontological representation and change may be an effective way to develop its use. In practice, capturing rationale has often been found difficult, to some extent because of problems with creating a clear representation of the context, the starting point of the decision issue, and then relating the rationale clearly to this. If we anchor the rationale within the kind of process that McNeill et al. [8] suggest, then it may have a much clearer and more effective starting point. Most of the discussion around these ontology-matching processes naturally enough arises in contexts where ontologies already exist in a more or less explicit form. In other situations, however, there may be none, or there may be only a rudimentary ontology, or one in the process of forming, and these processes may have a role in developing the ontology more fully. Such situations arise especially where ontologies are “crowd-sourced” or based on “folksonomies”. Here there is often some kind of resource for which metadata is required and has to be built up or added to gradually through an activity, perhaps of an online community. For instance, Microsoft offer

8

“Windows Azure”, intended to assist Open Government by supporting access to data with a system called “DataMarket”: as [citizens or developers] combine different concepts, they are actually recommending new relationships within the data ecosystem. Every interaction actually helps crowd-source changes to the ontology, which in turn makes every future interaction more powerful and relevant. [10] This can work, up to a point, but only in so far as the people combining the concepts somehow recognise and agree on what the concepts are. Sooner or later, there will be discrepancies, and then some matching process will be needed that can actually reconcile the existing ontology with proposed changes. On the face of it, systems like this are likely to support a monotonic process of accretion around an ontology that in itself is retained; and therefore additions that are in some sense inconsistent with the existing material are likely to be rejected. In general terms, something more flexible is needed, that can potentially accommodate in both directions. 3

An example: learning

Another example of a case like this is where we have data, such as video, that has rich content but little explicit representation of it. We are starting to develop automated means of analysing these kinds of data, but there is a long way to go. In the interim, it makes sense to involve communities of users in creating and improving metadata. A pilot project in Edinburgh has been developing a system for learners, in which learning materials are offered in the form of rich media that can be manipulated and annotated. Known as “YouTute”, the approach is based on videos of tutorial discussions that occur in ordinary tutorial groups, where typically 5-10 students will discuss a series of problems with a tutor. We collect naturally occurring tutorial dialogues as unedited video. Three streams are collected per tutorial (two from cameras, one from a “Smartboard” that captures anything written on it during the activity). These are later played in synchrony via a web-based interface that allows students to review the material and “edit” it by identifying segments that are of interest. These segments (which we call “tutes”) can be named, tagged, annotated and shared with other students. Students are able to see texts of relevant lecture slides, and the questions being discussed in the tutorial. The system has been deployed on several courses, is well received by students and seems to have worked especially well as a revision aid [11, 6]. A screenshot of the system in use appears in Figure 1.

9

Fig.1. YouTute interface

The process of editing the videos and selecting “good” dialogues is a shared activity, the responsibility of the students. It is a form of “social networking”, through which a community of students can emerge as learners who collaborate to create a new learning resource. This shared activity is also itself a learning activity, promoting reflection on the topics discussed, and re-evaluation of the original tutorial discussion. In the present context, the significance of this approach is that the learners will in the long run have to develop their own ideas about what constitutes interesting learning material, how it is organised, what are its constituent parts, how its various categories relate. Which is to say that they will have to evolve an ontology of learning materials, related probably also to the domain about which they are learning (in our cases so far, theoretical aspects of computer science). At present, all of this is implicit: the learners use only unstructured tags and free text, but in further development of the system a clearer ontology would certainly be required and desired, to facilitate organisation, search and use of the information. This ontology needs to be negotiated, and continuingly negotiable, as new learners enter the community, new topics are addressed, new ways of organising the material devised. However, there is a further wrinkle: we intend also that learners would be able to link other materials into the resource being constructed, e.g. other lectures found on the internet, pieces of video from broadcasters’ archive sites, documents and notes from a variety of sources. Whatever these things are, their content will need to be related to the existing material, but they will come with their own metadata, based on a variety of other ontologies. Once again we face the ontology matching problem. Again the message is that no amount of standardisation, e.g. exploitation of Dublin core or other schemata, will overcome this problem. But in fact it should not be seen as a problem — rather, it is an opportunity, to address the challenge and devise good approaches that will offer real flexibility combined with powerful support.

10

4

An example: graphical communication

In some respects, this discussion about ontologies can be seen as similar to certain aspects of reasoning by analogy, or using modalities such as graphics in reasoning. In analogical reasoning, one has to develop a mapping between structures, e.g. the “structure-mapping” approach of Gentner [5, 4]. There is a domain to be reasoned about, for instance the structure of atoms, and one finds a domain with relatable structure, e.g. the solar system, and uses the latter to reason about the former. Where there is analogy, of course, there is also disanalogy: one has to select specific elements of the two domains that will be related, disregarding others completely, and generally allowing that there will be some looseness of fit even between those that are used. Similarly in reasoning with graphics: Wang, with colleagues including the present author [13, 14], discussed some time ago a system at least superficially related at a formal level to the technique adopted by McNeill et al. [9], in which domains are represented algebraically as a signature of types of element, accompanied by a more specific theory. One domain, of course, is the graphical domain, in which objects such as perhaps lines and circles can be represented as being in various geometrical relationships. The other is whatever domain is to be depicted: once this is represented in a similar manner, a “signature morphism” (i.e. a structure mapping) can be defined that shows how various elements are depicted and e.g. what operations on the depiction are meaningful. Thus, for instance, one can give a clear semantic account of how Euler circle diagrams can be used to reason about the relationships between sets in general, or specific sets of particular things; or of how a certain sort of diagram can be used to depict both the atom and the solar system. It would certainly be desirable for collaborative work to be supported by these kinds of representations. Collaborators will naturally want to introduce new entities into the activity, and manipulate the ones that are already there. In this context, accommodating something as yet undepicted into a given representation (system) is like the process of accommodating it to a given ontology. In one sense, the ontologies of the graphical system and the depicted domain are very different, but under the morphism they are very close, if not identical. However, in the practical use of this kind of representation system, as in any other, there will always be change: changing contexts, purposes, user communities, etc. We want the mapping to subserve a purpose — the support of some argument or line of reasoning — and it may well happen, perhaps it will inevitably happen, that to subserve we must subvert, in the sense of changing the mapping to adjust its fit to differing and developing understandings of the domain and/or the reasoning process. 5

Conclusion

The point of these reflections has been to emphasise that the issues of flexibility and accommodation in formal systems are extremely general. Wherever communication happens, we need to support appropriate models of use. If Bakhtin, among many others, has alerted us to the ubiquitous nature of subversion, the ceaselessly carnivalesque nature of communication, we have also seen how this emerges in a variety of apparently quite formalised frameworks. It is not yet at all clear how in

11

general to approach resolutions, but we have noted that there are moves and pointers in encouraging directions. In general we concur with Bundy and McNeill [1] in identifying this as a key issue for further research. References 1. 2. 3.

4. 5. 6. 7.

8.

9.

10. 11.

12. 13. 14.

15.

Bundy, Alan and McNeill, Fiona (2006) Representation as a Fluent: An AI Challenge for the Next Half Century. IEEE Intelligent Systems, May/June 2006. Conklin, J. and Begeman, M. (1989) gIBIS: A Tool for All Reasons. Journal of the American Society for Information Science, 40, pp 200-213. Buckingham Shum, S. (1996) Design Argumentation as Design Rationale. In The Encyclopaedia of Computer Science and Technology, Marcel Dekker Inc: NY, Vol 35 Supp. 20, 95-128. Falkenhainer, B, Forbus, K and Gentner, D (1989). The structure-mapping engine: Algorithm and examples. Artificial Intelligence, 20(41): 1–63. Gentner, D (1983). Structure-mapping: A Theoretical Framework for Analogy. Cognitive Science, 7(2). Lee, J. (2010). Vicarious Learning from Tutorial Dialogue. In M. Wolpers et al. (Eds.): EC-TEL 2010, LNCS 6383, Springer Verlag, pp. 524–529. Lee, John and McMeel, Dermott (2007). 'Pre-ontology' considerations for communication in construction. In J Teller, J Lee and C Roussey (eds), Ontologies for Urban Development, (Computational Intelligence series vol. 61), Springer Verlag, pp169-179. McNeill, Fiona, Bundy, Alan and Schorlemmer, Marco (2003). Dynamic ontology refinement. Proceedings of ICAPS'03 Workshop on Plan Execution, Trento, Italy (2003 (June)). McNeill, Fiona and Alan Bundy (2010). Facilitating virtual interaction through flexible representation. Encyclopedia of E-Business Development and Management in the Global Economy, IGI Global, 2010. Microsoft (2011). DataMarket for Government. http://www.microsoft.com/windowsazure/government/default.aspx (accessed 01/06/2011). Rabold, S., Anderson, S., Lee, J. and Mayo, N. (2008). YouTute: Online Social Networking for Vicarious Learning. Proceedings of ICL2008. September 24 -26, 2008 Villach, Austria. Ramscar, M., Lee, J, and Pain, H. (1996). A cognitively based approach to computer integration for design systems. Design Studies 17, 465-483. Wang, D. and Lee, J. (1993). Visual reasoning: its formal semantics and applications. Journal of Visual Languages and Computing, 4, pp. 327-356. Wang, D., Lee, J. and Zeevat, H. (1995). Reasoning with diagrammatic representations. In J. Glasgow, N. H. Narayanan & B. Chandrasekaran, eds, Diagrammatic Reasoning: Cognitive and Computational Perspectives, pp. 339-393. Cambridge, Mass: MIT Press. Zappen, James P (2000). Mikhail Bakhtin (1895-1975), in Twentieth-Century Rhetoric and Rhetoricians: Critical Studies and Sources. Ed. Michael G. Moran and Michelle Ballif. Westport: Greenwood Press. 7-20.

12

Affect, Ontology and Sentiment Analysis Khurshid Ahmad1 1

Department of Computer Science, Trinity College Dublin-2, IRELAND [email protected]

Abstract. Affective computing involves understanding and representing how subjective experience is articulated, and how this articulation impacts on others especially in the critical areas of finance, commerce, well-being and security. There is a considerable pride in affixing oneself as rational but that leaves little room for exuberant, contrarian, or irrational behaviour and beings. I will look very briefly on limits of rationality, some of the developments in computerbased affect analysis and finish this paper by letting the reader know about my modest contribution to affective computing. Keywords: affective computing, sentiment analysis, behavioural finance, information extraction.

1

Limits of Rationality?

There is evidence of bounded rationality ([41]) in reactions to the news about financial markets appears to change the (numerical) prices and volumes of assets (e.g. shares, bonds, commodities) traded in the markets. Traders, for instance, keep on buying ever larger volumes of assets even when there is incontrovertible evidence that the prices are falling. Contrariwise, traders keep on selling ever larger volumes of assets when the market is rising. It has been suggested that there are limits to rational behavior at certain times, especially during the onset of a crisis. The rationalist approach is to discount the news altogether and focus on prices/volume: the problem here is that human beings do not always make rational choices ([42][40]); the stakeholders in a financial market, especially at the onset of boom or bust in the market, behave in an exuberant manner [39]. Humans have a propensity to choose radically different solutions to the same problem if the problem is expressed or framed differently ([30][31]). The images collateral to economic/financial news, for example, strikes and civil unrest, facial gestures of regulators, traders and investors, plays a key role in ‘framing’ news and blogs. Expert traders and regulators usually make judicious choices in aggregating linguistic, numerical and gestural information. It has been argued that ‘early’ financial decision making involves fast processes akin to mental arithmetic ([47]) and visual enumeration [17]). The mood and emotive state of traders plays an important role in financial trading. The affect content of the stakeholders in the market appears to have an impact on prices and volumes of asset traded.

13

Bounded rationality appears to be a characteristic feature when we look at almost any walk of human endeavor where an exchange takes place: money exchanged for goods and services, politicians writing manifestos pleading for votes [35] and voters’ ‘emotive’ response [36], description of ‘other’ people [2] and so on. In any of these exchange processes we see contrarian behavior, impact of framing, legacy effects and plain bias/prejudice. In such exchanges one has to take into the mood and emotive state of the stakeholders. Thus, we see that the advertisement and marketing world is keen to quantify the affect content of the reaction of the customer/voter/tourist to goods and services on offer [44] . The metaphorical use of language plays an important role in how we articulate our subjective experience especially our moods and emotions. An integrated study of how metaphors and emotive language, indicating the strength of emotion together with its polarity and action-orientation, are used in describing subjective experience on the one hand, and the relationship (if any) of the experience with observable facts, is critical for the development of affective computing and sentiment analysis [3].

2

Exuberant Markets and the Impact of News

‘Sentiment analysis is the task of identifying positive and negative opinions, emotions, and evaluations’ ([49]). Keynes had suggested that there are “animal spirits” in the market ([32]), and the contrariness and obfuscation of traders and regulators is caused by the spirits leading to irrational exuberance ([39]). The impact of news on financial markets can be substantial and, according to the Nobel Laureate Robert Engle ([23]), asymmetric: the arrival of ‘bad news’ has a longer lasting effect on prices, and particularly on volumes of shares traded, when compared to the arrival of ‘good news’. Economists typically use proxies for the news content – change in the values of currencies, bonds, or aggregated indices like the Dow Jones, NASDAQ. News was traditionally interpreted as a causal variable in financial market models: readily quantifiable aspects of news as a proxy for the news itself and includes the timing of news arrival, the volume of news items and the type of news ([14]). News proxies were used with some effect to show that ‘negative’ news has much longer lasting impact than the positive news ([24][23]). This kind of sentiment analysis is almost always conducted post-hoc ([15][18]) but there are some exceptions where the researchers have attempted to predict the market ([33]). News content analysis, rather than the use of proxies, is becoming more important recently. Continual records of market analysts’ opinions on commercial news channels shows that positive news has short term (1 minute) impact on prices but the negative news impacts for 15 minutes ([16]). It has been argued that the use of optimistic language in press releases of a firm appears to increase the firm’s future earnings, whilst pessimistic news has the opposite effect ([20]). Tetlock has shown that the negative affective component of news reports does have a longer lasting impact by analyzing opinion columns in financial news papers ([46][45]).

14

3

Dictionaries of Affect

One of the pioneers of political theory and communications in the early 20th century, Harold Lasswell ([34]), has used sentiment to convey the idea of an attitude permeated by feeling rather than the undirected feeling itself. This approach to analyzing contents of political and economic documents – called content analysis – was given considerable fillip in the 1950’s and 1960’s by Philip Stone of Harvard University who created the so-called General Inquirer System ([43]) and a large digitized dictionary – the GI Dictionary also known as the Harvard Dictionary of Affect. There are two conventional methods of creating affect dictionaries: First is to use a dictionary like the Harvard University’s General Inquirer (GI) lexicon ([43]) and to rely on the judgment of the dictionary makers. We have looked at the structure and content of GI dictionary and compared it with other dictionaries of affect, including derivatives of the WordNet: we found that the whilst the major dictionaries of affect share less than 50% of the vocabulary, yet the sentiment computation is only marginally effected by the choice of dictionaries ([22], [21]). Second, affect dictionaries can be created from a collection of texts –the so-called local grammar approach, which was pioneered by Zellig Harris [28] and his associates especially Maurice Gross [27], was adopted by us in the development of a sentiment extraction system: the system extracted statistically significant collocational patterns of candidate terms in a corpus of specialist texts ([6]) – these patterns invariably comprise metaphorical words relating to change and affect ([4], [12]) in close proximity with names of key objects, e.g. shares, currencies, and the names of key stakeholders/enterprises ([48]). The two methods are complementary, and the corpus based method can be used adaptively in order to cope with new words and changing sentiment lexicon. Machine learning algorithms are increasingly being used to identify sentiment bearing patterns automatically: typically, supervised learning algorithms are used, one uses a training corpus of texts comprising pre-tagged sentiment phrases and an external or exogenous variable, for instance prices of financial assets, and the algorithm learns to associate sentiment bearing phrases with changes in the direction of prices ([33],[25]).

4

Affect Analysis Systems

Typical affect analysis systems, especially those available commercially by major news vendors effectively manage a document stream. Each document in the stream is subjected to three different types of information extraction which leads to the computation of a sentiment score usually related to various statistics based on the distribution of polarity of sentiment-bearing phrases in texts (see [1], [38] for details). First, meta-level or document external information is extracted which may comprise time of publication, authorship attribution data, length of the document, originating and target sources. Second, information is extracted at least two levels of linguistic description – (i) lexical level where sentiment bearing words/phrases are

15

extracted by using an affect dictionary and (ii) syntactic level of linguistic description by looking at syntactic categories of words/phrases in the document primarily for disambiguation purposes. Third, phase of linguistic information extraction relates to DRAFT: Proof Reading required the authoritativeness of the document using Page Ranking and similarity measures. Finally, the system outputs a sentiment score which may be a sentiment word/phrase count or other a weighted average of the count (See figure below)

Document

Meta Level Information Extraction Origin of Document; Authorship Attributes; Time of Publication; Verbiage: Token Frequency, Word

Linguistic Information Extraction: Plain Hits Lexical Matching ! Identify Sentiment bearing words/phrases; Word Sense Disambiguation;

Linguistic Information Extraction: Fancy Hits Page Rank Similarity Measure

Document Sentiment Score Semantic orientation; multivariate analysis

Fig.1. Affect Analysis Process

5

My contributions to affect analysis

I have been working on sentiment analysis since 1997 and have focused on developing methods and techniques that will help in the extraction of affect – sentiment is a form of affect- from a collection of texts. The methods focused on the relationship between domain specific terms and affect words. The initial area of

16

application was financial markets: The integration of sentiment index, extracted from full-text newswires in English, and stock exchange indices was reported by us ([10], [11]). Subsequently, the use of computational grids and cloud computing was demonstrated and evaluated for the purposes of automatic extraction of sentiment words and technical terms from a corpus of financial news (c. 800,000 newswires) ([7] [4] [26]). The method developed by my colleagues was demonstrated to work on financial newswires in Arabic [12], Chinese ([9], [29]), and Urdu [13]. The work now embraces informal texts (blogs for example) [37], news reports about ideologically motivated groups [2]. Working in collaboration with the Trinity Business School, we were able to adapt the econometric notions of return– the rate of change of the price/volume of an asset- and volatility –the variance of the return- for financial and ideological sentiment analysis ([5]). Results have recently been published about the nature of the distribution of sentiment returns in German [37], North American, Japanese markets of the Irish markets [19]. I am currently studying how metaphorical language is conceived and used, how such complex psycho-linguistic constructs can be represented on computer systems and in different languages, and how the practice of affect analysis in fields as diverse as financial trading, brand management, product reviews. This study may help in formulating research questions in this critical area of how we the humans are stimulated by and respond to subjective judgments. Papers on this broad topic of affective computing have recently published as a collection comprising contributions from scholars metaphor and experimental psychology, artificial intelligence, computational linguistics and information extraction, translation studies, brand management and financial trading [3].

References 1.

2. 3. 4. 5.

6.

Abbassi, A., Chen, H., & Salem, A. (2008). Sentiment Analysis in Multiple Languages: Feature Seelection for Opinion Classification in Web Forums. ACM Transactions on Information Systems. Vol 26 (No.3). (Article 12). Ahmad, K. (2008) Edderkoppspinn eller nettverk: News media and the use of polar words in emotive contexts. , Synaps, 21, 2008, p20 - 36 Ahmad, K. (2011) (Ed.), Affective Computing and Sentiment Analysis:Metaphors, Emotions and Terminology. Heidleberg, Springer Verlag, 200 pp. Ahmad, K. & L. Gillam and D.Cheng. 2005 ‘Textual and Quantitative Analysis: Towards a new, e-mediated Social Science’. Proc. of the 1st Int. Conf. on e-Social Science (Manchester, July 2005). Ahmad, K. (2008) The ‘return’ and ‘volatility’ of sentiments: An attempt to quantify the behaviour of the markets?. In Proc. of EMOT 2008: Sentiment Analysis: Emotion, Metaphor, Ontology and Terminology. Workshop (13th Language Resources and Evaluation Conf., 27 May 2008. Marrakesh, Morocco). (to appear in K. Ahmad (Ed.) 2011) Ahmad, K. Pragmatics of Specialist Terms and Terminology Management. In (Ed.) Petra Steffens. Machine Translation and the Lexicon. 3rd Int. EAMT Workshop, Heidelberg, Germany, April 26-28,1993.) Heidelberg (Germany): Springer. pp.51-76.(Lecture Notes on Artificial Intelligence, Vol. 898, Eds. J. G Carbonell & J. Siekmann).

17 7.

8. 9.

10.

11.

12.

13.

14. 15.

16. 17. 18. 19.

20.

21.

22.

Ahmad, K., & L. Gillam and D.Cheng. 2005: ‘Society Grids’. In (Eds.) Simon Cox and David Walker. Proc. UK e-Science All Hands Meeting 2005. Swindon: EPSRC Sept 2005. pp 923-930. Ahmad, K., Bale, T., & Casey, M. Connectionist Simulation of Quantification Skills. Connection Science Vol. 14 (No. 3). pp 165-201. Ahmad, K., D. Cheng and Y. Almas, (2006)‘Multi-lingual Sentiment Analysis of Financial News Streams.’ In (Eds.) Stefano Cozzini et al. Proc. 1st Int. Conf. on Grid in Finance (Palermo, February 2006) (http://pos.sissa.it/archive/conferences/026/001/GRID2006_001.pdf) Ahmad, K., D. Cheng, T. Taskaya, S. Ahmad, L. Gillam, P.Manomaisupat, H. Traboulsi and A. Hippisley. (2006) ‘The mood of the (financial) markets: In a corpus of words and of pictures.’ In (Eds). A. Wilson, D. Archer and P. Rayson. Corpus linguistics around the world. Amsterdam/New York: Rodopi, pp 18-32. Ahmad, K., Taskaya Temizel, T., Cheng, D., Ahmad, S., Gillam, L., Manomaisupat, P., Traboulsi, H. and Casey, M. 2004 “Fundamental Data to Satisfi the Chartist.” The Technical Analyst, vol. 1(3), pp. 36-39. Almas, Y., & K. Ahmad. (2006) ‘LoLo: A System based on Terminology for Multilingual Information Extraction’. In (Eds.) Mary Elaine Califf et al. COLING ACL 2006: Workshop on Information Extraction beyond the Document, Sydney, Australia. Association of Computational Linguistics. pp56 – 65. (http://acl.ldc.upenn.edu/W/W06/W06-0207.pdf) Almas, Y., and K. Ahmad, (2007) A note on extracting ‘sentiments’ in financial news in English, Arabic & Urdu, The Second Workshop on Computation, al Approaches to Arabic Script-based Languages, Linguistic Society of America 2007 Linguistic Institute, Stanford University, Stanford, California., July 21-22, 2007, Linguistic Society of America, 2007, pp1 - 12 Antweiler, W., & Frank, M. Z..(2004) Is all that talk just noise? The information content of internet stock message boards. Journal of Finance, pp 1259-1294, 2004. Bauwens, L.; W.B. Omrane; and P. Giot. (2005). “News Announcements, Market Activity and Volatility in the Euro/Dollar Foreign Exchange Market.” Journal of International Money and Finance. Vol. 24, pp 1108-1125. Busse, J. and T. Clifton Green. Market efficiency in real-time. Journal of Financial Economics, Vol. 65:¸pp 437, 2002. Casey, M. & Ahmad, K. ‘A competitive neural model of small number detection’. Neural Networks Vol. 19 (No. 10), pp 1475 - 1489 Chang, Y., and Taylor, S.J. (2003) Information Arrivals and Intraday Exchange Rate Volatility. J. Int. Financial Markets, Institutions and Money, vol 13, pp85-112. Daly, N., K. Ahmad, and C.Kearney. (2009) Correlating Market movements with consumer confidence and sentiments: A longitudinal study. Proc. of Text Mining Workshop, Leipzig, Germany. Davis, A.K., Piger, J.M., Sedor, L.M.: Beyond the numbers: An analysis of optimistic and pessimistic language in earnings press releases. Technical Report, Federal Reserve, Bank of St Louis, (2006) Devitt, Ann and Khurshid Ahmad, Sentiment Analysis and the Use of Extrinsic Datasets in Evaluation, Language Resources and Evaluation Conference (LREC 2008), Marrakech, Morocco, 28-30 May, 2008 Devitt, Ann., and Ahmad, Khurshid. ‘Sentiment Polarity Identification in Financial News: A Cohesion-based Approach’. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. June 23–30, 2007, Prague, Czech Republic. Stroudsburg, PA: Association for Computational Linguistics (ACL) pp 984-991

18 23. Engle, R.F (2003) Risk and Volatility: Econometric Models and Financial Practice," Nobel Prize in Economics documents 2003-4, Nobel Prize Committee, http://nobelprize.org/nobel_prizes/economics/laureates/2003/engle-lecture.pdf 24. Engle, R.F., & V. K. Ng, 1991. Measuring and Testing the Impact of News on Volatility, NBER Working Papers 3681, National Bureau of Economic Research, Inc. 25. Généreux, M.,Thierry Poibeau and Moshe Koppe Sentiment analysis using automatically labelled financial news. In (Ed). K. Ahmad (2011) 26. Gillam, L., K. Ahmad, G. Dear. 2005 : ‘Grid-enabling Social Scientists: The FINGRID infrastructure.’ Proc. of the 1st Int. Conf. on e-Social Science (Manchester, July 2005). 27. Gross, Maurice (1997). “The construction of local grammars,” In (Eds.) E. Roche and Y. Schabès, Finite-State Language Processing, Language, Speech, and Communication, Cambridge, Mass.: MIT Press, 1997 , pp. 329-354. 28. Harris, Zellig. (1991) A Theory of Language and In-formation: A Mathematical Approach. Oxford: Clarendon Press. 29. Hippiseley, A., Cheng, D., & Ahmad, K. (2005) Head-Modifier Principle in Chinese. Natural Language Engineering Vol. 11 (2), pp 129-157. 30. Kahneman, D & Tversky, A. (1979) Prospect Theory: An Analysis of Decision under Risk. Econometrica, Vol. 47 (No. 2) (Mar., 1979), pp. 263-292 31. Kahneman, D. A. (2002). Maps of Bounded Rationality: The [2002] Sveriges Riksbank Prize [Lecture] in Economic Sciences. (http://nobelprize.org/nobel_prizes/economics/laureates/2002/kahnemann-lecture.pdf visited 14 April 2009). 32. Keynes, John Maynard (1936) The General Theory of Employment, Interest and Money. London: Macmillan (reprinted 2007) 33. Koppel, Moshe, and Shtrimberg, Itai. (2004), “Good News or Bad News? Let the Market Decide” In AAAI Spring Symposium on Exploring Attitude and Affect in Text , Palo Alto CA, pp. 86-88. 34. Lasswell, Harold D. (1948). Power and personality. London:Chapman & Hall. 35. Namenwirth, Zvi., and Lasswell, Harold D. (1970) The changing language of American values : a computer study of selected party platforms. Beverly Hills (Calif.) :Sage Publications. 36. Neuman, W.R., Marcus, G.E., Crigler, A.N., and Mackuen, M. (2007). (Eds.) The Affect Effect: Dynamics of emotion in political thinking and behaviour. Chicago & London: The University of Chicago Press. 37. Remus. R., K. Ahmad, G. Heyer. (2009) Integrating Indicators: Data Mining [German] Financial News, Blogs, Market Movement and Consumer Surveys for Sentiment. Proc. of Text Mining Workshop, Leipzig, Germany. 38. Shanahan, J.G., Qu, Y., & Weibe. J. (2006). (Eds.) Computing Attitude and Affect in Text: Theory and applications. Dordrecht: Springer. 39. Shiller, R. (2005). Irrational Exuberance (2nd Edition). Princeton: Princeton University Press, 2005. 40. Simon, H. A. (1978). Rational Decision-Making in Business Organization: The [1978] Sveriges Riksbank Prize [Lecture] in Economic Sciences (http://nobelprize.org/nobel_prizes/economics/laureates/1978/simon-lecture.pdf, site visited 14 April 2009). 41. Simon, H. A., (1955). A behavioral model of rational choice, Quarterly Journal of Economics, Vol. 69¸ pp 99-118 (1955). 42. Simon, H. A., (1956) Rational choice and the structure of the environment, Psychological Review, Vol 63: 129-138 (1956). pp 241-268 43. Stone, Philip,J., Dunphy, D.C., Smith, Marshall, S.S., Ogilvie and associates. (1966). The General Inquirer: A Computer Approach to Content Analysis. Cambridge/London: The MIT Press.

19 44. Teichert,T., Gerhard Heyer, Katja Schöntag and Patrick Mairif. Co-Word Analysis for Assessing Consumer Associations: A Case Study in Market Research. In (Ed.) Ahmad (2011) 45. Tetlock, P. C., Saar-Tsechansky, M., & Macskassy, S. (2008), More Than Words: Quantifying Language to Measure Firms' Fundamentals. Journal of Finance; Jun2008, Vol. 63 (No. 3), pp 1437-1467 46. Tetlock, Paul C. (2007). Giving Content to Investor Sentiment: The Role of Media in the StockMarket. Journal of Finance. June 2007, Vol. 62 (No. 3), pp 1139-1168 47. Thaler, Richard (2000). ‘Mental Accounting Matters. In (eds.) Daniel Kahneman and Amos Trevsky., Choices, Values, and Frames. Cambridge, New York: Cambridge University Press., 48. Traboulsi, Hayssam., & Khurshid Ahmad, Automatic Construction of a Thesaurus of Proper Names : A Local Grammar-based Approach, 8th Intext Nooj Workshop 2005 Formaliser les langues avec l’ordinateur: De INTEX à Nooj, Besançon (France), May 30June 1, 2005, edited by Svetla Koeva, Denis Maurel & Max Silberztein , Presses universitaires de Franche-Comté, 2007, pp261 – 280 49. Wilson, T., Wiebe, J., & Hoffmann, P. (2005). “Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis.” Proc. Conf. on Human Language Technology and Empirical Methods in Natural Language Processing Vancouver, British Columbia, Canada. pp 347-354

20

Automatic Feedback and Support for Students and Tutors Using CSCL Chat Conversations Traian Rebedea1, Mihai Dascalu1, Stefan Trausan-Matu1,2, Costin Chiru1 1

University “Politehnica“ of Bucharest, Department of Computer Science, 313, Splaiul Indepentei, 060042 Bucharest, ROMANIA 2

Romanian Academy Research Institute for Artificial Intelligence, 13, Calea 13 Septembrie, Bucharest, ROMANIA

{traian.rebedea, mihai.dascalu, stefan.trausan, costin.chiru}@cs.pub.ro

Abstract. Although online conversations have been widely used in various elearning and blended learning scenarios, especially in the context of collaborative problem solving or debating, there are very few tools for supporting the learners engaged in these activities and their tutors by means of delivering automatic feedback based on the analysis of their conversations. This paper presents PolyCAFe, a tool designed for providing automatic feedback and support for better exploring chat conversations of learners engaged in CSCL activities that employ the usage of instant messaging in medium and large-sized groups. The system uses natural language processing techniques, combines with latent semantic analysis in order to discover implicit links among utterances, discussion threads and other conversational cues. In order to determine the utility and effectiveness of using PolyCAFe in a real educational setting, a validation experiment consisting of 35 students that used the system for their assignments is described, together with some of the validation results. Keywords: Computer Supported Collaborative Learning, natural language processing, automatic chat assessment, dialogism, validation, experiment.

1

Introduction

It is widely accepted that learners often use chat conversations and discussion forums to discuss and solve problems in an informal context, often without the involvement of any teachers and tutors. One of the main reasons for this increase in using online discussions in learning is the wide spread and familiarity that the learners have in using these technologies “day by day” for any activities, including learning. However, with the rise of Computer Supported Collaborative Learning, the usage of chats and discussion forums in more formal education settings has been explored and advocated [1]. One of the main advantages of using these technologies in formal learning is that the students are encouraged to become active participants in a given discourse related to a subject or course. This activity corresponds to the socio-cultural paradigm of learning that considers that being an active participant in a specific discourse that is related to and that uses the concepts, topics and subjects of a given course or domain

21

is more important than the cognitivist paradigm of transferring knowledge from the teacher to the learners. Therefore, the students become participants in a given discourse, sometimes mediated by the tutors or teachers, using the technologies that are familiar to them. However, the increased usage of these technologies raised another important problem. Due to the fact that these online discussions, chats or forums, are usually used for groups of students (small or medium for chats and medium or large for forums) and because there is no impediment for more than one participant to hold the floor at a given moment in time (as compared to face-to-face conversations where only one participant is speaking at a given moment in time), these online discussions cannot be processed and analyzed using the same techniques as normal text or even face-to-face dialogues, which are usually analyzed using a two-interlocutors model [2]. Therefore a new theory of discourse has to be employed for analyzing these online discussions that takes into account and exploits the possibility of having multiple discussion threads that exist and evolve in parallel at a given moment in time. Starting from Bakhtin’s dialogism [3], the authors have proposed a new polyphonic theory for the analysis of multi-participant chat conversations [4, 5]. PolyCAFe is a system that uses Natural Language Processing (NLP), Latent Semantic Analysis (LSA) and Social Network Analysis (SNA) that was designed for the analysis of educational chat conversations and discussion forums in order to provide feedback and support to learners and tutors using these technologies for learning. It starts from the aforementioned polyphonic theory and model of discourse and determines implicit linkages between the utterances of an online discussion in order to determine the discussion threads and to build a conversation graph. Moreover, PolyCAFe is also used to measure and determine the areas of a conversation where the participants are engaged in a collaborative discourse, denoted by a good inter-animation of the discussion threads. Several other works [6, 7, 8] have been focused on analyzing online conversations of students, especially discussion forums, but neither presents a system that was especially designed for supporting the learners (most of them are for supporting the researcher or the analyst), nor do they use a special model of discourse adapted for this type of interactions. The rest of the paper is structured as follows: the next section provides an insight in the general architecture of the system, while section 3 offers a detailed presentation of the main processing steps in the analysis. Section 4 describes a validation and verification experiment performed in a formal educational context by using PolyCAFe for two chats assignments. The main quantitative results and outcomes of the validation are presented in section 5 and the paper ends with a SWOT analysis resulted from the validation experiment. 2

General Presentation of PolyCAFe

For an enhanced and holistic assessment of both participants and of utterances, a multilayered architecture has been used for the development of PolyCAFe, with each tier presented in figure 1 using results of the analysis from the tiers situated below. All the processing is thoroughly described in the remainder of the paper.

22

Feedback & Grading

Collaboration

Polyphony

SNA

Advanced NLP and discourse analysis

NLP Pipe

Surface Analysis

WordNet

Domain Ontology

LSA

Fig.1. PolyCAFe’s Architecture

1) First layer with surface evaluation, basic NLP processing and ontologies The first step in analyzing the raw transcript of a conversation is the NLP pipe which covers spelling correction, stemming, tokenizing, Part of Speech Tagging and NP chunking. Next, surface analysis is performed, consisting of various metrics derived from Page’s essay grading techniques [9] and readability measures [10]. The semantic sub-layer defines concepts using both linguistic or domain specific ontologies (e.g. WordNet) and Latent Semantic Analysis (LSA). These two approaches form the basics for a semantic evaluation of a participant’s involvement and evolution during a conversation. In contrast with the surface analysis which is based only on quantitative measurements, ontologies and LSA enable a qualitative assessment of the overall discussion and of each participant. 2) Second layer centered on advanced NLP and discourse analysis For a deeper insight of the discourse, techniques as speech acts identification, lexical chains, adjacency pairs, co-references and discussion threads are used for identifying interactions among participants, therefore highlighting implicit links. In addition, each utterance is evaluated using the semantic vector space from LSA in order to determine its importance regarding the overall discourse. 3) The third layer focused on collaboration, social networks analysis and polyphony The polyphony sub-layer uses the interactions (both explicit and implicit ones) and advanced discourse structures to look for convergence / divergence and polyphonic inter-animation. Social network analysis takes into account the social graph induced by the participants and their corresponding interactions that were discovered in the previous layer. Collaboration also plays a central role in the discourse for highlighting the involvement and the knowledge of each participant. This is essential from the perspective of Computer Supported Collaborative Learning because a chat or forum

23

with high collaboration has greater impact and is more significant in contrast with others in which the discourse is linear and the utterances are not intertwined in a polyphonic manner. The final step in the analysis combines the results of all previous sub-layers, granting access to textual and graphical feedback; also a grade proposal for each participant of a chat or forum discussion is made available. 3 3.1

Implementation Details Specific to each Sub-layer Surface Analysis

Two categories of factors are used for performing a detailed surface analysis: metrics derived from Page’s essay grading techniques and readability. Page’s idea was that computers could be used for automatically grading student essays as effective as human evaluators by using only simple, easily detectable and statistically driven attributes [9]. In the initial study, Page correlated two concepts: proxes (computer approximations of interest) with human trins (intrinsic variables used by human evaluator). Using only simple measures, the correlation with the manual grades was of 0.71 which proves that these kind of automatic assessments are possible. Starting from Page’s metrics for automatically grading essays, a series of factors with equal weights were identified for evaluating each participant at the surface level. All these factors were grouped into four categories: fluency (ex. number of total characters, number of total words, number of different words, mean number of characters per utterance, number of utterances), spelling (ex. percentage of misspelled words), diction (ex. mean and standard deviation of word length) and utterance structure (ex. number of utterances, mean utterance length in words, mean utterance length in characters). Readability [10] represents the reading ease of each reader and reflects one’s writing style. An easily readable text or discourse impacts comprehension, retention, reading speed and reading persistence. Although the discourse is different from normal texts, readability still offers a good perspective over the current level of knowledge, understanding or attitude in some cases. From the computational perspective, specific formulas are used in order to estimate the reading skill required, therefore providing the means to target an audience: Flesch Reading Ease Readability Formula evaluates the grade-level of a participant and the difficulty of reading the current text on a 100 point scale; Gunning’s Fog Index (or FOG) Readability Formula indicates the number of years of formal education a reader of average intelligence needs to understand the text on the first reading; Flesch Grade Level Readability Formula rates utterances on a U.S. grade school level - a score of 10 means that the document can be understood by a 10th grader. From this level on, two perspectives are analyzed: a quantitative one centered on involvement, mostly assessed through social network analysis means and a qualitative approach with competency measured using LSA, ontologies and heuristics in the grading process.

24

3.2

Tagged Latent Semantic Analysis

LSA is a technique used in vector-space based semantics for analyzing relationships between a set of documents and the contained terms by indirectly correlating them through concepts. This transformation is obtained by a singular-value decomposition of the array and a reduction of its dimensionality. Instead of using regular corpora made up of plain text, the designed system uses chats for the supervised learning process. Two preliminary actions are spellchecking and stop words elimination; next, POS tagging is applied and verbs are stemmed for reducing the corresponding forms identified in chats. The final step is word tagging (in case of verbs stem tagging) because same words, but with different POS tags have other contextual senses, and therefore other semantic neighbors [11]. Before performing the SVD decomposition, Tf-Idf (term frequency - inverse document frequency) is applied for improving the obtained results. The value for k (number of dimensions corresponding with the final projection) is 300, considered an optimal empiric value also approved by multiple sources [12]. LSA segmentation divides the chats firstly by participants and after by fixed nonoverlapping windows; in the end, segments with similar dimension and inner cohesion due to same speaker consistency are obtained. The purpose of integrating LSA in the assessment process is for estimating the proximity between two words, words and the whole document and similarity between two different utterances using the cosine measure. 3.3

Utterance Assessment Process

In order to obtain a thorough evaluation of involvement and relevance, the first step that needs to be undertaken is the actual evaluation of utterances in the discussion. This process involves building the utterance graph that highlights the inter-twining of utterances, the determination of each utterance’s importance in a given context and the assessment of on-topic relevance relatively to the entire discussion. The actual grading process of each utterance has 3 distinctive components: a quantitative, a qualitative and a social one, presented briefly table 1. Table 2. The main components of the utterance assessment Component Quantitative

! !

Qualitative

Social

! ! ! ! !

Factor NLP Pipe for utterance processing (spellchecking, stemming, tokenizing, POS tagging) Number of characters for each stem and corresponding number of occurrences as base of evaluation Semantic similarity based on LSA for assessment Predefined topics used for measuring utterance completeness Thread evolution with regards to future impact and thread coherence Overall discourse impacting utterance relevance Social Network Analysis applied on the utterance graph (degree)

25

The quantitative perspective evaluates on the surface level each utterance. Solely from this view, the assigned score considers the length in characters of each remaining word after stop words elimination, spellchecking and stemming are applied. For reducing the impact of unnecessary repetitions used only for artificially enhancing the grade, we apply the logarithm function on the number of occurrences of each word. Another improvement in evaluating each utterance involves applying the logarithm function on the previously obtained result, therefore considerably reducing the impact of oversized, but not cognitive rich utterances. A more interesting dimension is the qualitative perspective that involves the use of LSA in determining 4 different components: thread coherence, future impact, relevance and completeness. The final qualitative mark is obtained by multiplying all the previous factors. Starting from the utterance graph, thread coherence for a given utterance represents the percentage of links, starting from that specific utterance, which share a similarity above a given threshold. In order to ensure inner cohesion and continuity within a thread, similarities between any adjacent utterances must exceed the specified threshold (in our case 0.1). Future impact enriches thread coherence by quantifying the actual impact of the current utterance with all inter-linked utterances from all discussion threads that include the specified utterance. It measures the information transfer from the current utterance to all future ones (explicitly or implicitly linked) by summing up all similarities above the previously defined threshold. In terms of Bakhtin’s philosophy [3] this can assimilated to voice inter-animation and echo attenuation in the sense that multiple voices from current utterance influence directly linked utterances (both explicitly and implicitly). Future impact, therefore the echo of a given voice, is estimated by measuring similarity between the two linked utterances. Relevance expresses the importance of each utterance with regards to the overall discussion. This can be easily measured by computing the similarity between the current utterance and the vector assigned to the entire chat, therefore determining the correlation with the overall discussion. Because each discussion has a predefined set of topics that had to be followed and which should represent the focus concepts of the chat, completeness measures the coverage of these keywords in each utterance. In our implementation, completeness is obtained by evaluating the similarity between the utterance and the vector of the specific set of keywords specified by the tutor or teacher as important topics of the discussions. The social dimension implies an evaluation from the perspective of social networks analysis performed on the utterance graph. In the current implementation only two measures from graph theory are used (in-degree and out-degree), but other metrics specific to SNA (for example, betweenness) and minimal cuts will be considered. On the other hand, the centrality degree from social networks analysis is not very relevant because all the links follow the flow of the conversation and therefore all edges are oriented in the same direction.

26

3.4

Social Networks Analysis

Social Networks Analysis plays a very important role in current evaluation by offering the means of linking the qualitative and the quantitative approaches and by providing an integrated view of both implication and knowledge of all participants. Links between participants are based upon interactions identified using lower level entities (both implicit and explicit links). In this approach various metrics are computed in order to determine the most important participant in discussion: degree (in-degree, out-degree), centrality (closeness centrality, graph centrality, Eigen values) and user ranking based on the well know Google Page Rank Algorithm. For combining the two perspectives (quantitative and qualitative), SNA metrics are applied on the number of interchanged utterances and on the sum of corresponding scores of all interchanged utterances. All the applied metrics from the perspective of social network analysis are scaled between all participants being relevant only compared with other participants of the same discussion. 3.5

Collaboration

All previous metrics provide the basics for measuring collaboration in specific CSCL environments (in our case chats and forums). A new integrated vision of collaboration takes into account the following: • Social cohesion measured relatively to all SNA factors analyzed; • Quantitative collaboration based on the number of interactions between different participants • Mark based and gain based collaboration estimated upon the similarity of all interactions between participants and their corresponding scores. 3.6

Polyphony

Bakhtin’s work is particularly interesting because of the illustrated concepts that have great impact in the field of semantic analysis of a text, and in our case discourses. A very important concept presented in [4], in tight correlation with the notion of dialogue is polyphony. Polyphony can be defined relatively to multiple points of view and voices and it is closely related with the musical concept of polyphony. From Bakhtin’s perspective, Dostoevsky’s prose can be considered a true representation of polyphony because each character can be considered an individual voice, distinct from others. Also, Dostoevsky work presents conflicting views, not just various angles or like other novelist a single all-knowing and overwhelming vision. In current analysis, an utterance is considered the unit of analysis and the intertwining of voices/different perspectives is of outmost importance in discourse analysis. LSA, social networks, implicit links based on repetitions, co-references, discussion threads and the overall grading system ensure proper evaluation and highlighting of concepts and of voices. Further details about the computational approach and the technologies employed by PolyCAFe can be found in [13] and [14]

27

4

Validation and Verification Experiments

PolyCAFe has been used in a real educational setting, by undergoing a validation experiment that took place at University “Politehnica” of Bucharest. The details of this validation round are presented in table 2. Table 2. Overview of the validation experiment Course unit Course unit Domain Undergraduate (UG) / postgraduate (PG) and year of study Traditional / work-based / distance learning Learners Number of learners Experimental group: Control group: Mean age of learners Experimental group: Control group: Gender of learners Experimental group: Control group: Tutors Number of tutors taking part in main pilot Main pilot: age of tutors Number with age < 30 Number with age 30 – 40 Number with age > 40 Main pilot: Gender of tutors Number of tutors taking part in dissemination workshop Teaching managers Number of teaching managers Job titles of teaching managers

Human-Computer Interaction Computer Science UG year 4

Traditional learning. The chat conversations are a distance learning activity, facilitated by the university.

25 10 21.9 years 22.7 years 36% female, 64% male 67% female, 33% male 6

3 2 1 100% male 19

1 Head of the Computer Science Department

The learners were divided into groups of 5 students (5 experimental and 2 control groups) and were given two successive chat assignments related to Human-Computer Interaction to debate using the ConcertChat environment. The experimental group was asked to use PolyCAFe to get feedback for each assignment, while the control group did not use PolyCAFe for the first assignment. The use of PolyCAFe for the second assignment was not mandatory, so the learners had an option to use the system only if they considered it would be useful for their learning task. The two topics for the assignments were:

28

“A debate about the best collaboration tool for the web: chat, blog, wiki, forums and Google Wave. Each student shall choose one of the 5 tools and shall present its advantages and the disadvantages of the other tools. Thus, you will act as a "sales person" for your tool and try to convince the others that you have the best offer (act as a marketer - http://www.thefreedictionary.com/marketer). You must also defend your product whenever possible and criticize the other products if needed.” • “You are in the board of decisions of a company that plans to use collaborative technologies for its activities. Each of you has studied the advantages and disadvantages of the following technologies that are considered by the company: chat, blog, wiki, forums and Google Wave. Engage into a collaborative discussion in order to decide for which activities it is indicated to use each technology. You should give the best advice for the technology that you support and convince the others to use it. The result of this discussion should be a plan of using these technologies in order to have the best outcomes for your company. You can also think of other useful technologies beside these ones, but do not insist on them.” The tutors had to define the assignments in PolyCAFe, together with the list of relevant concepts from the latent semantic space for each assignment. The tutors also provided manual feedback to each of the students involved in a chat conversation for the first assignment. Thus, each tutor assessed one conversation without using PolyCAFe and one conversation using PolyCAFe. Meanwhile, no manual feedback was provided for the second assignment, only the feedback offered by PolyCAFe. Each tutor used PolyCAFe to help him assess and provide final (manual) feedback to 2-3 chat conversations. This experiment has shown that the tutors that use PolyCAFe are able to provide feedback for a chat conversation in less than 70% of the time required without using the system. Moreover, there were 2 verification experiments conducted: the first focused on determining the relative quality of the manual feedback provided by tutors with and without using PolyCAFe, and the second centered on determining the quality of the participants and utterance grading provided by PolyCAFe. In the first experiment, each chat conversation from the first assignment is provided with manual feedback from 4 tutors: 2 of them use PolyCAFe and 2 without using it. After that, the tutors decide which manual feedback is better (i.e. the feedback informed with / without PolyCAFe) by using a set of common indicators: quality of feedback related to participation and collaboration, quality of feedback related to the content of the conversation, coverage of the feedback. However, the increase in the quality and consistency of the feedback was only minimal when using PolyCAFe, which highlighted that the individual assessment and feedback style of each tutor was more predominant than the influence of PolyCAFe. In the second verification experiment, each tutor that does not use PolyCAFe for giving manual feedback for a particular chat conversation, and each student provided a ranking in order of merit of the participants to the chat conversation they attend, by considering (1) content, 2) collaboration and participation and, 3) overall. These rankings produced by tutors and learners were then compared with the ones provided by PolyCAFe. •

29

Thus, tutors (6) and students (35) manually ranked the participants to each of the 7 chat conversations for the first assignment. For each chat conversation, there have been 5 rankings from the students, plus two from the tutors that did not use PolyCAFe for providing manual feedback. The average ranking for each participant in a conversation was then computed and it was compared to the one provided by PolyCAFe for content and social impact. The results of this experiment have been presented in [15] and prove that the system has a correlation with the tutor annotations that is similar to the mean inter-rater correlation of the tutors and better results when compared to the inter-rater correlation of the students. 5

Validation Results

After the validation experiment was over, all the students from the experimental and the control groups were asked to fill in a 5-scale Likert questionnaire. More, some of the participants were randomly asked to take part in a focus group in order to determine the strengths and weaknesses of PolyCAFe. The results obtained from the analysis of the answers to the questionnaires highlight the following statements: 1) Claim: Students obtain feedback within an appropriate time after they finish a chat / forum discussion Argumentation: Most students reported that feedback is delivered in time by responding to the question “I find that the feedback delivered by the system is quick and the analysis is not taking long” with a Mean=4.33, SD=0.47, Agree/Strongly Agree = 100% (n=9). Average processing time per chat conversation is between 5-10 minutes. 2) Claim: The feedback and reports delivered by the system are considered useful by the students, tutors and relevant by the teachers Argumentation: Most students considered that the feedback and reports are useful for improving their future activity. The tutors account that the feedback and reports are useful for analyzing the activity of the students. The teacher also considered that the reports have a high degree of relevance. 3) Claim: Using PolyCAFe mediated collaboration has the potential to improve the learning outcomes of the students Argumentation: Most students consider that the use of the system has improved their learning outcomes by providing them information about what they have done wrong in the first chat round. This has been validated by multiple items in the Likert questionnaire, the most important one is “Overall, I believe that PolyCAFe provides adequate support for my learning” which has the following statistics: Mean=4.33, SD=0.47, Agree/Strongly Agree = 100% (n=9). 4) Claim: The visualization offers a better understanding* of the chat conversation (* Better understanding or insight means being able to extract more details about the chat in less time than using a usual text or chat log style alternative. To be able to understand what went good or wrong in their conversation with regard to collaboration and content (in order to know what to improve). Argumentation: Most students and tutors reported that the conversation visualization offers them a better insight about the discussion. The responses to the item “The conversation visualization provides a simpler understanding of the

30

conversation (in order to understand the structure and collaboration in the conversation) compared to a simple text presentation of the chat” have shown a Mean=4.44, SD=0.68, Agree/Strongly Agree = 89% (n=9). 5) Claim: The time needed for tutors to provide (1) final feedback and/or (2) grading is reduced Argumentation: All the tutors have shown that the time needed to analyse a chat conversation in order to provide feedback and/or grading to the students has been reduced. The improvement of the tutors’ productivity that used PolyCAFe is of at least 30% compared to the results of the tutors that do not use it. 6) Claim: The feedback and grading offered by the system are trusted by the tutors Argumentation: Most tutors consider that the automatic feedback and grading offered by the system can be trusted. However, the feedback provided by PolyCAFe is not perfect and there have been identified several errors and misleads in the reported feedback, but this were considered minor. 7) Claim: The feedback and grading offered by the system are trusted by the learners Argumentation: Most learners consider that the automatic feedback and grading offered by the system can be trusted. 6

Conclusions

The key observations with regards to PolyCAFe system are synthesized in the following SWOT analysis. The strengths of the system that would be positive indicators for adoption are: • PolyCAFe promotes learner reflection on their performance as individuals and as members of a group; • Feedback from PolyCAFe has been shown to improve the collaborative skills of learners in online discussions; • The institution requires less tutor time for feedback, support and grading; • PolyCAFe monitors the participation of the students makes the learning processes more transparent e.g. locating the outliers, positioning individual learners in the peer group; • PolyCAFe contributes to improving the consistency of feedback between tutors, especially if PolyCAFe is used as a start point when proving the manual feedback to the learners; • The feedback appears to have motivational aspects for engaging students into their activity. The weaknesses of the system that would be negative indicators for adoption are: • The usability and poor guidance when using the system provides discomfort to users; • It is difficult to interpret the results especially due to the high amount of information; • The generic trust and the reliability of the system should be improved by improving accuracy;

31

The system has further potential especially taking into consideration the opportunities for its use: • It is the only software on the market that provides complex feedback for online conversations that focuses on stimulating collaboration; • While the use of CSCL (conversations) is becoming more and more popular to relieve the tutor burden, the PolyCAFe software enables the tutors to monitor the individual contributions); • The system might be used as a feedback standard for training tutors assess collaborative activities; • There are a lot of situations when learners that participate in online discussions do not receive any feedback for their productions, therefore the automatic feedback provided by PolyCAFe would be very valuable; • PolyCAFe could also be used as a starting point for a chat agent that offers live feedback for students involved in chats and forums in order to motivate them or make them engage into a better collaborative discourse; • The use of Web2.0 in education often implies that there are a lot of textual outputs that are very small portions of text similar to a chat conversation. PolyCAFe could be easily adapted for this task. However, the following elements represent risks and threats with regards to the presented system: • PolyCAFe may not be suitable for all chat situations. It is more suited to chat where (1) grading and/or detailed feedback is required, and (2) where one of the aims of the chat/forum is for social learning to take place; • There may be change management issues in introducing automatic feedback / assessment systems into new environments. Introduction of the system has to overcome concerns about changes in working practices and whether PolyCAFe's output can be trusted; • Some learners may feel uncomfortable being monitored during their collaboration; • It may be difficult to integrate PolyCAFe with the IT architecture of the institution; • Institutions may be deterred from adopting PolyCAFe owing to the extent of initial training in interpreting the data; • Privacy issues might arise in some institutions when using the system to analyse the contributions of students. Acknowledgements The research presented in this paper was partially supported by the FP7 EU STREP project LTfLL and the national CNCSIS grant K-Teams. The authors would like to thank all the students who were part of the validation experiments.

32

References 1. 2. 3. 4. 5.

6. 7.

8.

9. 10. 11. 12. 13. 14. 15.

Stahl, G.: Group cognition: Computer support for building collaborative knowledge. Cambridge, MA: MIT Press (2006) Manning, C., Schutze, H.: Foundations of Statistical Natural Language Processing, MIT Press: Cambridge Mass (1999) Bakhtin, M.: Problems of Dostoevsky’s Poetics, Edited and translated by Caryl Emerson. Minneapolis: University of Minnesota Press (1984) Trausan-Matu, S., Rebedea, T.: Polyphonic Inter-Animation of Voices in VMT. In: Stahl, G. (Ed.), Studying Virtual Math Teams, pp. 451-473, Boston, MA, Springer US (2009) Trausan-Matu, S., Rebedea, T.: A Polyphonic Model and System for Inter-animation Analysis in Chat Conversations with Multiple Participants. In A. Gelbukh (Ed.), CICLing 2010, LNCS 6008, pp. 354–363, Springer (2010) Dong, A.: Concept formation as knowledge accumulation: A computational linguistics study. Artif. Intell. Eng. Des. Anal. Manuf. 20, 1, 35-53 (2006) Kontostathis, A. , Edwards, L., Bayzick, J., McGhee, I., Leatherman, A. and Moore, K.: Comparison of Rule-based to Human Analysis of Chat Logs. In 1st International Workshop on Mining Social Media Programme, Conferencia de la Asociación Española para la Inteligencia Artificial, (2009) Rose, C. P., Wang, Y.C., Cui, Y., Arguello, J., Stegmann, K., Weinberger, A., Fischer, F.: Analyzing Collaborative Learning Processes Automatically: Exploiting the Advances of Computational Linguistics in Computer-Supported Collaborative Learning , International Journal of Computer Supported Collaborative Learning, (2007) Page, E. B. Paulus, D. H.: Analysis of essays by computer. Predicting Overall Quality, U.S. Department of Health, Education and Welfare (1968) http://www.streetdirectory.com/travel_guide/15672/writing/all_about_readability_formula s__and_why_writers_need_to_use_them.html Wiemer-Hastings, P., Zipitria, I.: Rules for Syntax, Vectors for Semantics, University of Edinburgh, Scotland Lemaire, B.: Limites de la lemmatisation pour l’extraction de significations, JADT 2009: 9es Journees internationals d’Analyse statistique des Donnes Textuelles (2009) Dascalu, M., Rebedea, T., Trausan-Matu, S.: A deep insight in chat analysis: Collaboration, Evolution and evaluation, summarization and search. In proceedings of AIMSA 2010 (2010) Rebedea, T., Dascalu, M., Trausan-Matu, S.: Overview and preliminary results of using PolyCAFe for collaboration analysis and feedback generation. In proceedings of ECTEL 2010 (2010) Dascalu, M., Rebedea, T., Trausan-Matu, S.: PolyCAFe: Collaboration and Utterance Assessment for Online CSCL Conversations. Accepted paper at CSCL 2011 (2011)

33

Tags Filtering Based on User Profile Saida Kichou1, Youssef Amghar2, Hakima Mellah1 1

Research Center on scientific and technical information (CERIST) Benaknoun, Algiers, Algeria {skichou, hmellah}@mail.cerist.dz 2 INSA of Lyon, Computing on Images and information Systems Laboratory. Lyon, France. [email protected]

Abstract. The 'Collaborative Tagging' is gaining popularity on Web 2.0, this new generation of Web which makes user reader/writer. The 'Tagging' is a means for users to express themselves freely through additions of label called 'Tags' to shared resources. One of the problems encountered in current tagging systems is to define the most appropriate tag for a resource. Tags are typically listed in order of popularity, as del-icio-us. But the popularity of the tag does not always reflect its importance and representativeness for the resource to which it is associated. Starting from the assumptions that the same tag for a resource can take different meanings for different users, and a tag from a knowledgeable user would be more important than a tag from a novice user, we propose an approach for weighting resource’s tags based on user profile. For this we define a user model for its integration in calculating the weight of a tag and a formula for calculating it based on three factors namely the user, the degree of approximation between its interests and the field of resource, expertise and personal confidence for tags associated to the resource. A resource descriptor containing the best tags is created. A first evaluation of this approach on an information retrieval system, confirms a net improvement compared to the sole use of popularity. Keywords: Collaborative Tagging, User profile, information retrieval

1

Introduction

Collaborative tagging has emerged in the social web (Web 2.0) as a support to the organization of shared resources by allowing users to categorize and find these resources. In this paper, we study collaborative tagging systems, which, like any other research environment, are not spared the problem of accessibility of information. In these systems, a considerable number of users tagged (annotate) shared resources, a resource (text document, image, video) may be affected several and divergent tags. The Tagging system established popular tags that are usually displayed as a cloud or tag list. A resource is represented by its most popular tags, sorted in descending order. The user expresses his need for information through these tags. If they are not fairly representative of the resource to which they are associated, the user need is therefore likely to be dissatisfied.

34

Knowing that the popularity of a tag for a given resource is the number of times it is cited, a popular tag is not necessarily representative of its resource. According to [6], it is very common for a user to repeat the same tags already associated with the object, this can make the tag repeated popular without been really relevant to the content. A user need may be unsatisfied [16], it can be a noise (return of an irrelevant resource because of the association of unrepresentative tags), or a silent (omission of a relevant resource because of the non popularity of representative and important tags). We therefore believe that the popularity so calculated is not sufficient to say that popular tag is representative for a given resource and need other criteria to determine the adequacy of the tag. We integrate the notion of user profile, assuming that the same tag for a resource can take different meanings for different users, and a tag from a knowledgeable user would be more important than a tag from a novice one. Thus, we propose a weighting of tags based on the preferences and user activities. The paper is organized as follows, Section 2 shows briefly what the Collaborative Tagging is, and states works on tags-based research and user profile in collaborative tagging systems. We present our approach in Section 3. Finally, we finish with a conclusion. 2

Collaborative Tagging and Tags-based research

Collaborative Tagging denotes the process of free associating one or more "tags" to a resource (web page, photo, video, blog ...) by a set of users. The term tagging is often associated with folksonomy, it refers to a classification (taxonomy) made by users (Folks) [10], [15], and defined by [2] as a series of metadata created by a collective users to categorize and retrieve online resources. The main action in collaborative tagging systems is of course the association of tags to resources by users, this provides a research and exploration of content, this which we name tags-based research. 2.1

Collaborative Tagging

Many tagging systems are present on the web such as Delicious for web pages, Flickr for images, YouTube for videos, Technorati for blogs and CiteULike for scientific papers. In a model of tagging, links can be found between resources (such as links between web pages), and between users (social network). We can see this in the conceptual model [11]. 2.2

Tags-based research and user profile

Several studies are underway to improve research-based tags. For example, based on the fact that exploration based solely on popularity is limited [16] suggests to offer the user a specific tag clouds and proposes to incorporate the user profile to rank

35

resources by their degrees of relevance based on a probabilistic model in the research process. He personalizes the model of collaborative tagging, by suggesting tags to particular user annotating a particular resource, depending on the user’s preferences. Other works attempt to exploit this user profile in different ways, such as [18] which implements the vector model of information retrieval, and introduces a user’s tags as his interest’s vector. It is also the case in [5] which considers that the tags are a new type of feedback of the user, and can be a very important indicator of his preferences. The actions of Tagging provide information that can be used to improve system knowledge about the user. Different approaches of profile construction based on tagging are presented in [6] which also offer a new approach based on creation of graph of tags associated with a user taking into account the age of the tag. To create a specific and dynamic user profile basis of tags, [9] introduced the concept of capacity of the tag to represent a resource based on two factors, the order of tagging and popularity. According to [8] the first tag given by a user for a resource is more representative than the following. While [9] uses the tags to build the user profile, we build, in our case, the profile with the aim of creating resource descriptors compounds considered more representative tags. In the remainder of this paper, we propose a new approach for weighting tags based on the user profile. 3

A Filtering tags approach based on user profile

We present a new approach which aims to build a set of tags describing resources in the most precise and accurate possible manner. This descriptor will be used for research or for classification purposes. The idea is to find another criterion for the classification of tags than popularity. The approach is to integrate the user in calculating the weight of the tag associated with a given resource. For this, we first present a model of the user profile for containing their personal information, activity and expertise, and a method of construction of this profile-based tagging. Then, we propose a weighting tag formula incorporating the user profile. Tags Weighted thus obtained are classified by descending order of weight and the first n form the descriptor of the resource. 3.1

The model of the user profile

To integrate the user profile in the calculation of tags weights, we define a user model representing information reflecting his activity in the system. We describe the user model adopted then we explain the process of building it. 3.1.1 Representation of the profile The user profile is a structure of heterogeneous information, which covers broad aspects such as cognitive environment, social and professional users [14]. This heterogeneity is often represented by a multidimensional structure. Eight dimensions

36

in the literature are defined for the user profile [1], [3]: the personal data, interests, expected quality, customization, domain ontology, the return of Preferences (feedback), the security and privacy and other informations. Defining the profile of a particular user for a given application is equivalent to selecting the dimensions considered useful [3]. In our work, a user is defined by three dimensions. The first containing personal information, the second represents its interests and the latest information on the degree of expertise in the domain. • The personal dimension: is used to identify the user (username, name, login, password ...). These informations are entered by the user. • The interest’s dimension: a resource whose context is close to the area of user interest is very lucky to be tagged more efficiently. Dimension center of interest Int (ui) tells us about the interests and preferences of the user. She is represented as a vector of weighted tags, constructed using a combination of two approaches of building profile, the naive approach [6], and co-occurrence approach gave in the technical analysis of social networks [17]. Int (ui) = {(t1, w1), (t2, w2) ... ... (tj, wj)}. • The expertise dimension: expert users in a given domain, tend to use specific terms to tag since they have a perfect mastery of the concepts in this domain. This dimension is the degree of mastery of the user in tagged resources domain. It depends on the tag levels in the domain ontology used for this purpose. More the expertise is great more the user is close to the context of the resource. 3.1.2 Construction of the profile The construction of the user profile returns for building dimensions Int (u) and exp (u) based on tagging operations it performs. 3.1.2.1. Construction of the interests’ dimension

The two approaches most commonly used for the construction of interests of users based on their tags are the naive approach and the approach by co-occurrence [6]. The naive approach builds the profile with the top tags given by the user to taggue all resources, in order of popularity (Example in Figure 2). Its simplicity and speed of implementation make it an approach widely used especially in the form of tag clouds, but the resulting tags are usually generic terms and are selected over specific terms. The co-occurrence approach, is the creation of a graph where nodes represent tags cited by the user and the edges are the relations of co-occurrence between these tags. The arcs are weighted by the number of co-occurrences. The resulting profile is the top k nodes participating in the arcs with the greatest weight. This approach is widely used for the detection of relationships between tags [4], extracting light ontologies [12], and recommending tags [19]. It helps to build a more accurate profile than that obtained with the naive approach. However, it has the disadvantage of neglecting the resources with unique tag. We propose in our work to combine these two approaches, in fact, the combination of these, not only eliminates this problem, but also used to weight the tags, which is not permitted with the co-occurrence approach. The result of

37

the combination is a graph of weighted nodes and arcs (Fig. 1). Nodes (tags) belonging to the weighted arcs with the greatest weight form our interest vector.

Fig.1. Example of a graph constructed with the combination of the naive approach and the cooccurrence approach

Figure 1 shows the graph of a user profile with the weight tags co-occurrences and the popularity of each one. The figure below shows the vector of interest of that user based on the naive approach by co-occurrence and the hybrid approach with k = 5.

Fig. 2. Comparison of vectors constructed with different approaches 3.1.2.2. Construction of the expertise dimension

An expert user in a domain has a perfect mastery of specific terms in this domain. It therefore tends to associate these terms with specific resources that he tagged (e.g. in pharmacy, an expert associates the name of a drug molecule, whereas a novice just associates the term 'medicine'). In our approach, we use domain ontology and we see the tags associated with the user for all resources, and locate their levels (depths) in the hierarchy of the ontology. More profound is the tag used more expert is the user. Expertise is the average depth of the tags of the user, calculated as follows:

38

[1] Where Prof. (t) depth tag tj is the number of nodes separating it from the root; Tu is a subset of the user’s personomy containing tags that it has associated to resources, defined as follows: T = {tj | (ui, tj, r) ! Y} with Y the set of annotations (the tagging actions). According to Table 1 below which shows the depths of the tags of Figure 2, the expertise of the user is 7.9 Table 1: Example of calculation of user expertise Tag

Depth

Tag

Depth

Tag

Depth

Programming

9

reference

7

Tutotial

9

Python

12

Free

5

Web

3

html

9

Maps

7

Rails

8

Css

9

Data

5

Ajax

9

Design

9

Ruby

9

Gis

8

Geocode

7

Books

9

Video

5

Javascript

8

Google

11

3.2

Weighting tags based on user profile

The weight of a tag is calculated according to the user who issued it. The same tag will be assigned two different weights, if the two users are different. On the other hand, for the same user, the tags associated with a resource should have different weights. We define the weight of the tag depending on the user profile represented by its two dimensions interests and expertise. With the aim to introduce the subjective aspect of the tag, we also introduce feedback to the user via a rating. The weight of a tag is calculated as follows:

[2]

Where dist (Interet (ui) Popularity (r)) represents the degree of approximation between the resource and the interests of the user. This is the distance between the vector Interet (ui) and the vector of the resource consists of popular tags. This distance is calculated using the cosine formula as follows:

39

with [3]

Conf (u, r) represents the degree of trust (or confidence) of the user in his tag. This is achieved via a rating from one (01) to five (05) every time he tagged a resource. It is calculated as follows: [4] The expertise of a user is calculated over the whole domain, divided by the distance of the user with the vector of the resource, we seek the contribution part of the resource in the expertise of the user. This resource is more close to the user over the distance is small and therefore the ratio is high, so a user who tagged a resource close to its interests and therefore confirms its expertise gives it a heavy weight. While a resource that diverges from its interests should not have great weight in the name of the user's expertise in the domain. The degree of user trust in the tag associated to the resource is used as a kind of weight regulator. If the user is not at all sure of his tag, it assigns a rating of 0 and the calculated weight becomes a simple calculation of popularity, while if the user assigns the maximum score, its profile is fully used in the weight of the tag. So it is the degree of introduction of user profile in the calculation of the weight of the tag. In the following example, we show tags associated by the four users to the url: http://blogs.msdn.com/jensenh/default.aspx, it is extracted from delicious whose vector of its popular tags is the following: {design: 3, css: 2, html: 4, tools: 4}. Figure 3 shows the interests of four users calculated in the same manner as in the example in Figure 2. Table 2 summarizes the tags associated to this resource by the four users and tags popularity and weight calculated with our approach. Distances, expertise and confidences of users are in Table 3.

Fig. 3: Interests vectors of the four users

40

Table2. List of tags associated to the resource u1 u2 u3 u4 popularity html x x x x 4 tools x x x x 4 design x x x 3 css x x 2 video x 1 go x 1 maps x 1 presentation x 1 programming x 1

weight 24.31 24.31 9.85 16.55 2.09 5.5 2.26 2.26 14.46

Table3. Distances, Expertise and confidences of users dist(u,r) Exp(u) Conf(u,r) U1 0.28 7.9 0.8 U2 0.64 5.4 0.8 U3 0.60 3.8 0.4 U4 0.64 2.5 0.6

The new vector of the resource is {html: 24.31, tools: 24.31, css: 16.55, programming: 14.46}. Comparing this vector with one built with the popularity of tags, we note that taking into account the user profile in the calculation of weight favors tags from expert users at the expense of the most popular tags when they are cited by users of any expertise. As is the case for the tag programming whose popularity is lower than the tag design but cited a more expert user. On the other hand, tags that have the same popularity are assigned different weights. 4

Experimentations

To test our weighting scheme we have proposed an information retrieval system based on tags. The goal is to see if the result obtained using the new weight of the tags is better than that obtained with the popularity only. For this, the results of two searches (by popularity, by weight) are compared with results of the information retrieval system that we have developed for this purpose. We have therefore indexed the web pages corresponding to URLs used and we have implemented the information retrieval system based on vector space model and the formula of tf-idf [7] to the weighting of index terms which the similarity function is scalar product between the query vector (in our case popular tags) and the resource vector (the index of the web page). We conducted tests on a collection of 149 URLs extracted from Del-icio-us, tagged by 6 users with different profiles using 215 tags and an average of 43.66 URLs tagged by a user. WordNet was used to calculate the tag depth.

41

4.1

Evaluation process

After we index the 149 web pages, we have, first, removed all tags and all the keywords of the index do not appear in WordNet. Then for the k best tags of the collection (most popular), we searched on the index already built. The vector obtained is considered the ideal vector VI. A similar search is performed on the tags of URLs based on the popularity of tags then another search based on new weights calculated with our formula; we thus obtain two vectors, respectively, VP and VW. Finally we compared these two vectors to the ideal vector VI. 4.1.1 4.1.1 Construction of the ideal vector VI For each of the top 50 most popular tags, we constructed an ideal vector as follows. The tag in question is considered the request for the search engine that searches the index and retrieves the list of URLs corresponding to the tag introduced. We consider the vector obtained from the index as ideal because it is obtained from keywords constituting the content of the webpage. It represents the content of the page better than tags which are not necessarily representative. 4.1.2 Construction of the popularity vector VP For a given tag, the system searches the URLs that have been tagged with this tag and gets his popularity. The resulting URLs are then ranked in descending order of popularity. 4.1.3 Construction of the weight vector VW In the same way as for popularity, we construct the weight vector but this time the resulting URLs are not ranked in order of popularity but in order of the weights calculated with our formula. The search result form the vector weight (VW). 4.1.4 Comparison of VP and VW to VI The purpose of the comparison is to see which of the two vectors obtained VP or VW is the closest one to the ideal vector VI. We calculated the cosine distance between VI and VP, then between VI and VW. The distances obtained with 10 tags were represented in figure 4. Recall that when the distance is great vectors are similar.

42

Fig.3. Comparison curve of vectors obtained with 10 tags

4.2

Results and discussion

Several tests were performed on the collection by setting the number of tags on which to perform research (the parameter k mentioned above). From the various results, we realized that the results of two studies (by popularity and by weight) converge over a certain number of tags (usually 10% of total tags). This is because the tags ranked low on the list have a low popularity and thus a weight as low as the number of users decreases with popularity. The curve below is obtained with 10% of tags.

Fig.4. Comparison between research based on popularity and the new weight of tags

5

Conclusion

Freedom of choice of tags by users is causing many problems, among other awarding tags unrepresentative. Tags are ranked by popularity, whenever, a popular tag is not necessarily representative of the content to which is associated.

43

In this paper, we proposed an approach to weighting tags based on the user profile with the aim of creating a descriptor fairly representative of the resource content. We have therefore defined a model of the user profile in three dimensions: personal informations, interests and expertise and approach for building the profile which is a hybrid of both naive and co-occurrence approaches. The weight of the tag is calculated based on three factors, the distance between the vector constructs of interest and the resource vector composed of popular tags, the expertise of the user and the trust factor that allows the user to evaluate himself compared to the resource it tagged. To evaluate our approach we have proposed an information retrieval system based on tags. The goal is to see if the result obtained using the new weight of the tags is better than that obtained with the popularity alone, we have developed an information retrieval system by implementing the vector space model. Research-based index is taken as the ideal outcome with which we compared the popularity- based research and the new weight-based research. The assessments we have conducted show a marked improvement in search results using the new weight. However, it should be noted that the quality of users plays an important role and that in some contexts, the results may deteriorate. This is the negative side of the subjectivity of the formula represented by the confidence given by the user. One of possible perspectives is the use of domain ontology for a better definition of expertise; in our implementation we used WordNet to retrieve the depth tag. The application of the approach on a collection of resources in a given domain and using ontology of this domain could have more interesting results. References 1. Amato.G, U. Straccia, User Profile Modeling and Applications to Digital Libraries, In: Proceedings of the Third European Conference on Research and Advanced Technology for Digital Libraries, Paris, France, 1999. 2. Broudoux.E : folksonomie et indexation collaborative, rôle des réseaux sociaux dans la fabrique de l’information. Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland, May, 2006. 3. Bouzeghoub.M, D.Kostadinov: Personnalisation de l'information: aperçu de l'état de l'art et définition d'un modèle flexible de profils. In Proceedings of Actes de la Conférence francophone en Recherche d'Information et Applications CORIA'2005. pp.201~218. 4. Cattuto.C, C. Schmitz, A. Baldassarri, V. D. P. Servedio, V. Loreto, A. Hotho, M. Grahl, and G. Stumme. Network properties of folksonomies. AI Communications Journal, Special Issue on ”Network Analysis in Natural Sciences and Engineering”, 2007. 5. Carmagnola.F, F. Cena, L. Console, O. Cortassa, Cristina Gena, Anna Goy, Ilaria Torre : Tag-based User Modeling for Social Multi-Device Adaptive Guides, Special issue on Personalizing Cultural Heritage Exploration, 2008. 6. Cayzer.S, E. Michlmayr : adaptive user profiles : chapitre de livre Collaborative and social Information Retrieval and Access, ISBN-13: 9781605663067, 2009. 7. Gerald J. Kowalski, Mark T. Maybury: Information storage and retrieval systems Theory and Implementation, Second Edition, KLUWER ACADEMIC PUBLISHERS, 2002.

44 8. Golder Scott A and Bernardo A. Huberman : The Structure of Collaborative Tagging Systems. Journal of Information Science32(2):198--208Aug, 2005. 9. Huang.Y, C. Hung, J. Hsu : You are what you tag : Association for the Advancement of Artificial Intelligence (www.aaai.org), 2008. 10. Mathes: Folksonomies - Cooperative Classification and Communication Through hared Metadata. Rapport interne, GSLIS, Univ. Illinois Urbana- Champaign, 2004. 11. Marlow. C, Mor N, Danah B, and Marc.D. Tagging, taxonomy, flickr, article, toread. In Collaborative Web Tagging Workshop at WWW’06, Edinburgh, UK, 2006. 12. Mika.P: Ontologies are Us : a Unified Model of Social Networks and Semantics. In ISWC, volume 3729 of LNCS, p. 522–536 : Springer. 2005 13. Rupert.M , S. Hassas : Building Users’ Profiles from Clustering Resources in Collaborative Tagging Systems. AMT'10, Proceedings of the 6th international conference on Active media technology, 2010. 14. Tamine-Lechani.L, N.Zemirli, W.Bahsoun: Approche statistique pour la définition du profil d’un utilisateur de système de recherche d’informations. Actes de la Conférence francophone en Recherche d'Information et Applications (CORIA 2006), Lyon : France 2006. 15. Vanderwal. T: Explaining and Showing Broad and Narrow Folksonomies. http://www.vanderwal.net/random/entrysel.php?blog=1635 – 2005 16. Wang.J, M. Clements , J. Yang , A .de Vries, Marcel J.T. Reinders Personalization of tagging systems. Information Processing & Management, 2009. 17. Wasserman.S., & Faust, K. (1994). Social Network Analysis. Cambridge, UK: Cambridge University Press, 1994. 18. Xu.S, S.Bao, B. Fei : Exploring Folksonomy for Personalized Search. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 2008. 19. Xu.Z., Y. Fu, J. Mao, and D. Su. Towards the semantic web: Collaborative tag suggestions. WWW Workshop on Collaborative Web Tagging, 2006.

45

Modeling with Fluid Qualities: Formalisation and Computational Complexity Matei Popovici1, Cristian Giumale1, Alexandru Agache1, Mihnea Muraru1, Lorina Negreanu1 1

POLITEHNICA University of Bucharest Computer Science Department

{pdmatei, cristian.giumale, alexandruag, mmihnea, lorina.negreanu}@gmail.com

Abstract. Traditionally, ontological approaches provide static descriptions of a modeled universe, and have no inherent means for representing evolution. Ontology description languages such as OWL [1] have no dedicated structures for holding and reasoning about temporal properties of a domain. Also, temporal modeling methods such as Temporal Logic [2], Temporal extensions of Description Logics [3] and Event Calculus [4] are unable to fully characterise the evolution of a process. This work proposes a modeling method based on fluid qualities aimed at representing time-dependent properties. We formulate a basic reasoning mechanism, action applicability, and discuss it’s computational complexity. We show that, in many cases, action applicability is tractable. However, in the worst case scenario, the same problem can be NPhard. The proposed method has wide applicability in areas such as pervasive and ubiquitous computing as well as the Semantic Web. Keywords: Semantic Web, ontologies, NP-hardness, computational complexity, temporal modeling.

1

Introduction

The representation of time has been a popular topic in Artificial Intelligence and the primary focus of formal methods such as Temporal Logic [2], Temporal extensions of Description Logics [3] and Event Calculus [4]. Each of these modeling methods deals with particular aspects related to time, but none of them focus on representing the temporal evolution of a system or a specific application domain. Languages for developing ontologies based on DL (Description Logics) such as Ontology Web Language (OWL) [1] are designed to describe a certain domain-specific status-quo. Capturing changes is often difficult in these settings. Take for instance OWL-Time [5]. It describes temporal concepts such as TemporalEntity, Instant and Interval using existing modeling primitives from OWL, but offers no inherent mechanism for describing properties that held in the past, events that modified their existence and temporal relations between different properties. This would be the task of application modelers that intend to use OWL-Time. Models such as these are difficult to develop and maintain.

46

An alternative is to incorporate time into the modeling method itself, as a special modeling primitive. Temporal extensions of Description Logics (TDL) use this approach. Unfortunately, only restricted fragments of TDL are decidable [3]. This work introduces an ontology-based modeling method for time-dependent applications, that attempts to overcome some of the drawbacks mentioned above. The paper focuses on a formal definition for this method, together with some complexity issues that arise. A key element of our approach is related to temporal representation. Instead of using FOPL as in Temporal Logic [2] or Event Calculus [6], we use a hypergraph. It is able to efficiently store the history of a particular domain. Furthermore, information from the past can be used to condition the evolution of the domain. The modeling method has direct applications in areas such as Ubiquitous or Pervasive Computing [7], and can be used in order to define context-specific, intelligent behavior for software and devices. The paper is structured as follows. In Section 2 we present existing methods for representing and reasoning about time, and describe some of their drawbacks. In Section 3 we informally introduce our modeling method, and examples of it’s applicability in intelligent device control. In Section 4 we describe a formalisation of our method, and prove that, in some cases the basic reasoning method: action applicability, can be NP-hard. The proof is done in two steps. It first uses the NPcompleteness proof from [8] to reduce the SAT problem to Normalised-SetIntersection. Then Normalised-Set-Intersection is reduced to Normalised-SetIntersection. In Section 5 we formulate conclusions concerning the efficiency of the proposed method, and discuss future work. 2

Related work

The following modeling methods will be discussed: Temporal Logic [2] and Linear Temporal Logic (LTL) [9], Temporal Logic of Intervals (TLI) [10,6], Event Calculus (EC) [6], Description Logics and their temporal extensions [3]. These methods have been selected out of many others for the following reasons: they have had a significant influence on temporal representation research and, at various degrees, they all share similarities with our approach. A common feature to all of the above methods is the usage of either Propositional Logic or First Order Predicate Logic (FOPL) as a vehicle for knowledge representation. In Temporal Logic, predicates model properties of a given domain and the operators !! !! !! ! are used in order to specify temporality. For instance, !"!!! means that !!!! must have been true, sometime in the past. Similarly, !"!!! states that !!!! will be true sometime in the future, !"!!! means that !!!! has allways been true in the past and !"!!! states that !!!! will allways be true in the future. Similarly, in LTL, both unary and binary temporal operators are used in order to specify temporal properties of propositions [9]. In TLI, Allen uses predicates such as !"#$"%, !"#$%&'(, !""#$ etc. in order to specify relations between intervals. The entire framework of Description Logics is based on a restricted, but not completely decidable subset of FOPL. EC also uses special predicates: !"#$%", !"#$%, !"#$%, !"##$##$# etc. in order to describe

47

temporality. For instance, !"#$%!!"#$%!!! !!! denotes the validity of ! after event ! was produced. Our modeling method also uses predicates i.e. qualities to denote properties of a domain, but temporal information is stored in a hypergraph. As a result, temporal reasoning is done by means of hypergraph explorations, instead of inferences on formulae from Propositional or First Order Logic. As will be discussed in the following sections, this approach has several advantages. With respect to the means of describing time, current approaches can be classified into: (1) approaches that define special temporal primitives. For instance, in Temporal Logic and TLI, modal operators for time are defined. Also, in temporal extensions of Description Logics, a temporal dimension is added. As a result, instances are seen as pairs consisting of individuals and the intervals on which they are enrolled in a particular concept; (2) approaches that use existing modeling primitives in order to represent time. Therefore, temporal concepts are introduced at the same representational level with other domain-specific concepts. It is the case of OWLTime and other approaches based on Description Logics. Also, TLI and EC use a similar approach. Modeling methods from the second category have the disadvantage of producing complicated models that are hard to develop, and reason about. Methods from the first category are usually more expressive, but the temporal representation layer is often responsible for undecidability. An example is the temporal extension of Description Logics, where only fragments are decidable [3]. Our modeling method also uses special primitives for time, but, assuming that the representation of a domain is finite which is often the case in real applications, temporal reasoning is decidable. This issue will be presented in more detail in the following sections. Both Allen’s TLI, LTL and EC focus more on describing and reasoning about a temporal discourse, and less on the dynamic nature of the discourse. In contrast, our method provides means for modeling both history and evolution of a domain. As mentioned before, temporal reasoning is done by exploring a hypergraph, but this structure is not static, it changes by continuously adding new instances, and modifying existing ones. 3 3.1

A modeling method based on fluid qualities Individuals

Individuals are independent, atomic entities identifiable by themselves. They are perene: during the evolution of a model, individuals do not disappear or suffer structural changes. They can however acquire or lose qualities. Individuals can be encoded as symbols carrying a certain semantic meaning for the modeler or as literals belonging to well-known data types (numerical, boolean etc.).

48

3.2

Qualities

A fluid quality (or simply quality) !!!! ! ! ! !! ! represents a time-dependent n-ary relationship between individuals !! ! ! ! !! . A unary quality !!!! stands for a timedependent property associated with individual !. Qualities hold on specific timeintervals !!. For instance, assuming that ! is an individual, the quality !"#$%"!!! specifies that ! is a device. Similarily !"#$%&'$!!! !""!"!! is a binary quality stating that ! consumes !""!"!. Here, !""!"! is a numeric individual. Qualities are created by instantiating a quality prototype !!!! ! ! ! !! !, where !! ! ! ! !! are variables. We say that a quality was destroyed if its time-slice ended at a certain moment in time. Once introduced in a model, qualities are never completely erased. Qualities are similar to OWL concept instances. A major difference is that qualities are time-dependent and thus need not be unique. For instance take the pair of qualities !"!!! and !""!!!. They can appear several times in the evolution of a model, as a device can be turned on or off repeatedly. Using qualities, the number of times a device was on can be identified, by examining qualities !"!!! from the past. 3.3

Actions

An action !!!! ! ! ! !! ! represents an external stimulus that changes the state of the modeled universe. An action results from the instantiation of an action prototype !!!! ! ! ! !! !. Actions enroll one or several individuals, and may require preconditions in the form of existence of particular qualities. For instance, an action !"#$%$!!! might require the qualities !"#$%"!!! and !""!!!. When an action occurs and all its preconditions are satisfied, the action produces effects: the creation and/or destruction of existing qualities. From this point of view, an action can be considered as a constructor as well as a destructor of qualities. For instance, the action !"#$%$!!! would produce as effects the destruction of quality !""!!!, which had to exist, and the creation of a new quality !"!!!. With respect to time, actions are instantaneous: they occur at a particular moment of time and have no duration. 3.4

The hypergraph

A set of action and quality prototypes represent a model definition for a particular domain. They specify the types of qualities and actions that can occur, and the conditions in which they occur. From this perspective, model definition is static. On the other hand, qualities and actions are related to the evolution of a model. They are dynamic. In order to store temporal information about qualities and actions, a hypergraph is used. It consists of (1) action nodes, (2) temporal hypernodes, (3) quality edges and (4) precondition edges. Action nodes stand for actions and temporal hypernodes stand for moments of time. A temporal hypernode contains all action nodes occuring at the same moment of time. Qualities are represented as edges. They span the action nodes responsible for creating and destroying that particular quality. Quality edges have labels consisting of their textual representation of the form

49

!!!! ! ! ! !! !. Precondition edges connect qualities that act as preconditions for an action, to the action itself. Consider a basic specification from the following program. It defines a model for air conditioners which contains a set of quality and action prototypes as well as two predefined qualities: AC(d)and Off(d). Notice the preconditions as well as effects for the defined actions. quality AC(?x), On(?x), Off(?x), crtTemp(?x,?y), fanStarted(?x) individual d AC(d), Off(d) action: turnOn(?x) preconditions: AC(?x), Off(?x) as ?off effects: destroy ?off, On(?x) action: temp(?x,?y) preconditions: AC(?x), On(?x) effects: crtTemp(?x, ?y) action: startFan(?x) preconditions: AC(?x), On(?x) effects: fanStarted(?x) Fig. 1 shows a possible hypergraph evolution for the model specification described in the above program.

Fig. 1. The hypergraph

First of all, notice that a hypergraph has two initial temporal nodes, each containing a special action: !"#$ and respectively !"##$%&. !"#$ is the implicit constructor for all predefined actions, and !"##$%& is a pseudo-destructor of all qualities that currently hold. In Fig. 1 a. the predefined qualities for individual ! can be seen. When an action !! ! !"#$%$!!! is signalled, its preconditions are checked. Since both preconditions are satisfied, !! is executed and the hypergraph is modified to accomodate its effects as seen in Fig. 1 b. The quality !""!!! is destroyed and !"!!! is created. !"#$%"!!! and !""!!! are the matched preconditions of !! . The corresponding precondition edges are shown as dotted lines in Fig. 1 b. Fig. 1 c. shows the hypergraph after the execution of two simultaneous actions !!"#!!! !"!"#! and !"#$"%#&!!!. The effects of these actions can also be seen.

50

In the examples from Fig. 1, actions were conditioned by the existence of qualities at the current moment. In addition, qualities from the past can be used as preconditions, and particular relations temporal relations between qualities can be enforced. For instance, a precondition Rains(?x) just_after Hot_day(?x) would be matched by a portion of the hypergraph presented in Fig. 2.

Fig. 2. Temporal constraints between qualities

The temporal reasoning methods based on the hypergraph are: (1) action applicability which allows action execution to be constrained and (2) action firing which allows actions to be executed when a desired state is achieved. In the paper we will discuss only action applicability and it’s computational complexity. 4

A formal definition for fluid qualities

The terms action and quality, used informally before, have a dual meaning. On one hand, they refer to a textual representation, such as !!!! ! !! !, which conveys the intended semantics given by the modeler. For instance !"#"$!!! !"! means that an individual ! is in state !". On the other hand, an action denotes a node, and a quality denotes an edge, both in a hypergraph. These constructs hold a temporal meaning, exploited by the modeling method. For instance, two qualities !! ! !!! ! !! ! and !! ! !!! ! !! ! have a specific temporal relationship: !! ends exactly when !! starts, because !! is the destructor of !! and the constructor of !! . Both facets described above are strongly related: each quality !!!! ! ! ! !! ! occurs at a certain moment in time, and each edge ! ! !!! ! !! ! must have associated a textual representation. Nevertheless, there are particular situations when only the textual form for qualities/actions are of interest to us, and, viceversa, when only the temporal form is important. For this reason, we introduce two modeling methods based on qualities. The first one is non-temporal: it contains no temporal references, and, as a result, no hypergraph. In this setting, qualities either hold (or are valid), or not. The second method is temporal: it builds onto the static one by adding a temporal dimension. Its usefulness is in describing and exploring history. In the following we give definitions for non-temporal and temporal models, and prove that action applicability can be (in the worst case) NP-hard, in both of these settings.

51

4.1

Non-temporal models

Let !! ! ! be variables, and !! ! ! be individuals. In some situations, ! can represent values such as integers, floats etc. Let !"!! be a set of symbols, which will be used for names of qualities. Let !!!! ! ! ! !! ! be a formula. ! ! !"!! identifies its name, !! ! ! ! can be either variables or individuals, and ! gives the formula’s arity. To avoid cluttering the notation, we will often use !!!!, where ! ! !!! ! ! ! !! !. Let ! ! !!! be a substitution ! ! !!!! ! !! !! ! ! !!! ! !! !!. It binds variables from ! to individuals from !. !!! represents the application of substitution ! on formula !. ! ! ! !!! is a formula where, for each !!! ! !! ! ! !, all occurences of variable !! in ! have been replaced with !! . For example, !!!! ! !! !!!!!! ! !! !! !!! ! !! !! ! !!!! ! !! !. Two substitutions !! and !! are called inconsistent if there is !!! !! ! ! !! and !!! !! ! ! !! such that !! ! !! , in other words, if each binds the same variable to different individuals. If a substitution ! leaves unbound variables in !!!, then ! is partial with respect to !. Otherwise, ! is called complete. We also consider the void substitution ! ! ! as being a valid substitution. Let !! ! !!!!! ! ! ! !! !!!! ! ! ! !! ! !! be a set of formulas satisfying the following property: !!!!! !! !!!! ! ! !! ! ! ! !. In order words, all names of formulas are unique with respect each other. Let ! ! !!!! ! !! ! ! ! !"#. 1. if !!! ! !! !! ! !, then ! is a quality prototype (we use the subscript ! as in !! to emphasize this). It contains only variables, and no individuals; 2. !!, !!! is a partial textual instantiation. It can contain both individuals and variables. It might be the case that ! contains no individuals, making it structurally similar to quality prototypes. 3. if !!! ! !! !! ! !, then ! is a complete textual instantiation. It contains invididuals only. Examples: Let !! !! ! ! !! !! !! ! ! !. Then !! !!! !! !! is a quality prototype, !! !!! !! !! is a partial textual instantiation and !! !!! !! !! is a complete textual instantiation. Property 1. For any textual instantiation ! ! !! , partial or complete, there exists and is unique a quality prototype !! ! !! and a substitution ! such that !! !! ! !. We say that !! is the prototype of !. Let !"!! a set of symbols for denoting actions. !!!! ! ! ! !! ! is a formula, where ! ! !"!! , and ! is the formula’s arity. Then !! ! !!!!! ! ! ! !! !!!! ! ! ! !! ! !! is a set of formulas satisfying the following property: !!! !!! !! !! !!! ! ! !! ! !! ! !! . Let !!!! ! !! . If: 1. !!! ! !! !! ! !, then ! is an action prototype (also denoted by !! ) 2. !!, !!! is a partial textual instantiation. 3. !!! ! !! !! ! !, then ! is a complete textual instantiation

52

For any textual instantiation ! ! !! , partial or complete, there exists and is unique a action prototype !! ! !! and a substitution ! such that !! !! ! !. We say that !! is the prototype of !. In a non-temporal model, a precondition of an action ! is the validity of a complete textual instantance from !! . Let !"!!" ! !! ! !!!! ! a function that binds each action to a set of preconditions. In a static modeling method, a precondition consists of the validity (or holding) of a quality at a given moment. Let !" ! !!"#$%#! !"#$!"#!. Elements from !" designate possible types of effects an action can have on textual quality instances. Let !""! !! ! !"!!! a function that binds each action to a set of effects. The effects refer to the creation or destruction of a quality. Examples: !"!!" !!"#$%$!!!! ! !!"#$%"!!!! !""!!!! !""!!"#$%$!!!! ! !!!"#$%#! !"!!!!! !!"#$%&'! !""!!!!! Let !!"#$ be a predicate which specifies if a complete instance ! ! !! is valid at current moment. If !!"#$!!! then ! is valid. Non-temporal models Let !!" !!! ! !! ! !"!!" ! !""! !! ! be a non-temporal model for fluid qualities. A state ! of a model !!" is given by the predicate !!"#$, describing the qualities that hold at the current moment. !! is the initial state of model !". A transition from state ! to ! ! is performed by the execution of a textual action instance. The execution is done only if the action’s preconditions are true. Action applicability Let ! a complete action instance, !! it’s prototype and ! a substitution such that ! ! !! !!. ! is said applicable iff there exists a substitution ! ! such that !! ! !"!!" !!! !! !!"#$!!!!!! ! !. In the above definition !!!!! ! refers to the subsequent application of substitutions ! and ! ! . The definition can intuitively be read: An action is applicable if there is a substitution ! ! which can transform every precondition to a textual instance that holds. Hardness of the applicability problem In the following, we give a constructive definition for the applicability problem, and prove it’s NP-hardness. To do this, we reduce SAT to another problem, Normalised-Set-Intersection, which is a particular case of the Set-Intersection problem. The latter is shown to be NP-complete in [8]. Then, action applicability is reduced to Normalised-SetIntersection, thus concluding the proof. First of all, we reformulate the above definition of action applicability. Let !"!!" !!! ! ! !!! ! !! ! ! ! !! ! be the preconditions of !! . !! ! !"!!" !!! ! might be partial, or complete instances. Let !"#$% ! !!! ! !! ! ! ! !! ! be the set of qualities that hold at one moment, i.e. !!! ! !"#$%! !!"#$!!! !. Qualities !! can only be complete. In order for action ! to be applicable (! ! !! !!), each quality !! ! !!! !!! !! ! !! ! ! !! !!! must: 1. either be in !"#$%, if !! is complete; 2. or there exists a substitution !! such that !! !!! ! !"#$%. Moreover, any substitutions !! , !! belonging to different qualities must not be inconsistent, that is, bind the same variable to different individuals. Formally: !!! ! ! ! ! !! ! !!! ! !! ! ! !! ! ! ! ! ! !! ;

53

3. For each quality !! ! !!! !!! !! !!! ! ! !! !!! from the set of preconditions of action !, we distinguish the following cases:!!! is complete, and !! ! !"#$%. Then, precondition !! is satisfied for action !, and this does not affect in any way the satisfiability of other preconditions of !; 4. !! is complete, and !! ! !"#$%. Then, precondition !! is not satisfied, and therefore ! is not applicable; 5. !! is partial, and there is no other quality in !"#$% with the same prototype as !! therefore again, ! is not applicable; 6. !! is partial, !! is its prototype and !! ! !! !!! . All qualities !! ! !! !!! from !"#$% having the same prototype, are such that !! and !! are inconsistent. Example: !"#$% ! !!!!! !! !!! !!!! !! !!! and !! ! !!!! !! !!. !! !! ! ! !! !! ! ! !. In this case, ! is also not applicable; 7. !! is partial, and there are several qualities in !! ! !"#$% and respectively substitutions !! , such that !! !!! ! !! If all preconditions of an action ! are in either of the four cases described above, then action applicability can be solved by simply exploring !"#$%, in !!!!"#$%!! time. Case 5. is more complicated. Any selection of a !! for precondition !! might affect the satisfiability of other preconditions. Example: Assume !"#$% ! !!! !!! !!! !! !!! !!! !! !!!! and the preconditions of an action ! as being !!! !!! !!! !! !!!!. For precondition !! !!! !! we have two substitutions !! ! !!!! !!! !!! !!! and !! ! !!!! !!! !!! !!! that satisfy it. Nevertheless, if we choose !! precondition !! !!! cannot be satisfied. In order to explore case 5., let us re-formulate the applicability problem (ActionApp). For each precondition !! , there are several qualities !! ! !"#$% such that !! !!! ! !! . Then, for each precondition !! we build a set of substitutions !! , containing !! ’s. In this setting, a precondition !! is represented as a set of substitutions that satisfy it. For instance, in the above example we have !!! ! !!!! !!! !!! !!!, !!" ! !!!! !!! !!! !!!, !! ! !!! !! and !! ! !!!! ! !!" !, !! ! !!! !. Let ! be the number of variables present in all preconditions of an action. Then, let !! ! ! ! !! be sets of bindings such that !! contains all possible bindings of variable !! to all individuals. In the above example we have ! ! ! variables: !! ! and !! ! !!!! !!! !!! !!!, !! ! !!!! !!! !!! !!!. The following properties must hold for sets !! and !! : 1. !!! ! !! ! ! ! !! !! !! ! !. Any two sets !! and !! are disjoint. 2. ! ! !!!! ! ! ! ! !! . Any substitution from any set !! must contain bindings present in some !! , therefore the reunion of all substitutions from all !! ’s must be a subset of the reunion of !! ’s. 3. !!! ! !! ! !! ! !!! ! !"#$%; All substitutions from a set !! must have the same size, since they represent textual instances of a quality prototype with a fixed arity.

54

There exists a set ! such that !!! !! ! !! ! ! ! and ! ! ! !!! ! such that !!! !! ! ! !! ! ! ! ? In other words, there exists a collection ! of substitutions, such that at least one substitution is chosen from each !! (and therefore each precondition is satisfied), and at most one binding is chosen from each !! (therefore any variable is bound to at most one individual, and as a result all substitutions are not inconsistent one with respect to another) ? We call this problem Action-App. The first constraint can be reformulated, such that ! must obey !!! !! ! !! ! ! !. This does not affect the correctness of the above definition. Assume that for some !, !! ! !! ! ! !. This means that in ! there must be at least two bindings of the same variable to different individuals. This contradicts the assumption that only one binding from a variable is chosen from the corresponding !! . Also, the second constraint can be can be transformed into !!! !! ! ! !! ! ! !. From our setting, each binding from !! s must exist in some !! , therefore it cannot be that !! ! ! !! ! ! ! for some !. We will use the latter formulations, using inequalities, in order to facilitate the reductions. Notice that the above formulation is a particular definition for action applicability. More precisely, it adresses only cases 2. and 5. presented above. Case 2., (a complete precondition !! not present in !"#$%) can be interpreted as !! ! !. No ! can be selected such that !! !! ! ! !, therefore the action is not applicable. In the first, third and fourth cases, !! cannot exist, in order for our formulation to make sense. Concluding, in some situations (cases 1,2,3,4), action applicability can be computed in !!!!"#$%!!, time. For these scenarios the applicability problem is tractable. Nevertheless, there are cases when action aplicability reduces to solving a Action-App problem, which is NP-hard. In order to prove the hardness of Action-App, we start from another problem: Normalised-Set-Intersection, and show it is NP-hard. Hardness of Normalised-Set-Intersection 1. Let the following problem: Given sets !! ! ! ! !! , !! ! ! ! !! , with the following properties:!!!! !! ! ! !! ! ! !! ! ! !, we have that !! !! ! !; 2. ! !! ! ! !! There exists a ! such that !!! ! !! !! ! ! ! and !!! ! !! !! ! ! ! ? We call this problem Normalised-Set-Intersection, and show by reduction from !"# that it is NP-hard. Let ! ! !! ! !! ! ! !! a CNF-formula, having ! clauses, ! literals and ! variables. For each variable !! , we build a set !! ! !!! ! !! ! where !! and !! represent truth values that can be assigned to variable !! . This can be done in !!!! time. For each clause !! we build a set !! according to the following rules: 1. if !! is a variable in !! that is not negated, then add !! to set !! . 2. if !! is a negated variable from !! , then add !! to set !! .

55

All sets !! can be built in !!! ! !!"# ! time, where !!"# is the maximum number of literals a clause !! can have. According to the above construction we have that the number of sets !! is !, and the number of sets !! is !. Notice that all !! are mutually disjoint, since any !! contains a unique assignment for variable !! . Also, it is clear that ! !! ! ! !! , since elements from all !! contain variable assignments. The described transformation is tractable. ! If ! is satisfiable, then !! such that !!! ! !! !! ! ! ! and !!! ! !! !! ! ! !. If ! is satisfiable, then !!! in !, there is a literal !! in !! that must be true. If !! is a variable that is not negated, then put !! in !. If !! is a variable that is negated, then put !! in !. Therefore, !!! ! !! !! ! ! !. ! will contain (at least one) variable assignment that makes the clause !! true. Since only one truth value is allowed for a variable we have that !!! ! !! !! ! ! !. If !! !! ! ! !, then variable !! can have either value: true or false. It does not affect the satisfiability of it’s clause !! . ! If !! such that !!! ! !! !! ! ! ! and !!! ! !! !! ! ! !, then ! is satisfiable. For each !! , if !! ! ! !! , make variable !! true in !! , and if !! ! ! !! , make variable !! false in !! . Since, each element appears in only one !! set, there are not conflicting assignments, and therefore, for the above assignments, each !! is true. As a result, ! is satisfiable. To conclude, Normalised-Set-Intersection is NP-hard. Hardness of Action-App Let !! ! ! ! !! , !! ! ! ! !! be an instance of NormalisedSet-Intersection problem. Based on it, we build an instance of Action-App. Consider that each element from !! s and !! s represent bindings of action variables !! ! ! to individuals !! ! !. Let ! be a number selected randomly such that ! !! !. We group sets !! , ! ! ! ! ! such that each set in a group has the same size and no more than ! sets can be in the same group. For instance, if all sets !! have the same size, then we build !!!!! ! ! groups. The first !!!!! groups have ! sets, and the last group has ! ! !!!!! sets (the rest of them). Let ! be the total number of groups obtained this way, and !! the number of sets in !! group !. Therefore !!!! !! ! !. If !!"# be an upper bound for set size of !! ’s, !! !! therefore !!! ! !!! ! ! !!"# . Then groups can be built in !!! ! !!"# ! time: we count each element of a set !! , and do that for ! sets !! , ! ! ! ! !. For each group of sets !! ! ! !!! , build the cartesian product !! !!! ! !!!!! . For each element in the cartesian product, build a set of bindings !. Populate each !! , ! ! ! ! !, with substitution sets ! built this way. As a result, !! will contain !!! ! ! !!! ! ! ! ! !!!! ! substitutions. They represent the ways quality !! is instantiated. For this reason, it is necesary that all substitutions have the same size. As a result property 3. (from Action-App) is true. The building of a single !! is done in !!!!! ! ! !!! ! ! ! ! !!!! !! time. We consider that the number of elements !! in group ! is constant with respect to !, but the number

56 !

! of groups ! varies. For !!"# defined previously, the building of the cartesian product

!

!!

!

! ! takes !!!!"# !, where !! is fixed, and !!"# varies according to the problem’s input. If !!"# (the maximum group size) is an upper bound for all !! ’s: !!! ! !! ! !!"# ,

!

!!"#

!

! ! the construction of all !! ’s is !!! ! !!"# !, where ! and !!"# vary according to the inputs, and !!"# is considered fixed with respect to the problem’s input. Let sets !! be constructed directly from !! ’s and as a result ! ! !. Properties of !! ’s described in Normalised-Set-Intersection formulation directly entail properties of !! in Action-APP. (more precisely, properties 1., 2.) As a result, the above transformation is tractable. ! Assume ! exists for a instance of Normalised-Set-Intersection, such that !!! !! !! ! ! ! and !!! !! !! ! ! !. Since ! selects at least one element from any !! , then ! also selects at least one element from each !! in a particular group. Therefore there exists a tuple ! ! !! !!! ! !!!!! , for each group. Finally, there exists at least one substitution in !! , containing all elements from tuple !. As a result, a set of substitutions ! such that !!! ! !! ! !! ! ! ! exist. ! contains sets of elements selected by !. If we build the reunion ! ! ! !!! ! we obtain exactly !, and since !!! ! !! !! ! ! ! then also !!! ! !! ! !! ! ! ! is true. ! Assume ! exists such that !!! ! !! ! !! ! ! !. !! selects at least one set ! which contains elements from the cartesian product !! !!! ! !!!!! . By adding elements from ! in !, ! covers sets !! ! !! ! ! ! !!! . If we add in ! all elements from the !’s selected by all !! ’s, we cover all sets !! ! ! !! . Therefore !!! ! !! ! !! ! ! !. Since ! ! ! !, we imediately have that !!! ! !! ! !! ! ! !. To conclude, Action-APP is NP-hard.

4.2

Temporal models

In temporal models, textual representations of qualities and actions described previously, are extended to elements from a hypergraph. Let ! ! ! be actions (or action nodes), and ! ! !!!"#$" ! !!"# ! be qualities, represented as edges between actions. !!"#$" is an action that introduces (or creates) ! and !!"# is the action that destroyes !. Let !! be a set of quality edges. Let ! ! !!! ! ! ! !! ! be a temporal node that encloses simultaneous actions. Notice that ! ! !, and ! ! !!!! where ! is the set of all temporal nodes. Each temporal node ! ! ! is associated to a unique special action !!" ! !!" , where !!" is the set of special actions. They are useful for defining an ordering of temporal nodes. This will be discussed later. Let ! ! !!! !! be an edge between a quality edge and an action node. It marks quality ! as a precondition for action !. !! is the entire set of precondition edges. Then !!! ! !! ! !! is a projection function that maps a quality edge onto a textual quality instance. For any ! ! !! , if !!! !!! ! ! and !! is the prototype of !, we also say that !! is the prototype of !.

57

Let !!! ! ! ! !! be a function that maps an action node onto a textual action instance. Similarly, for ! ! !, !!! !!! ! ! ! !! , having the prototype !! , we say that !! is also the prototype of !. For each !!" ! ! mentioned above, there is a unique special quality ! ! !!!" ! !!" !, and !!! !!! ! !"#$%!&#'!!!! ! ! !. timestamp is a special quality and ! is a special individual that represents a numeric value. Let !"#$%!!! ! ! be a function that returns the timestamp values of temporal nodes. Since !"#$%!!! is numeric it can be used as a basis for a total ordering relation over temporal nodes, denoted as ! ! . If !"#$%!!! ! ! !"#$%!!! ! we say that !! ! ! !! . The hypergraph. ! ! !!!! !!" ! !!"!# ! !!"##$%& ! !! !! ! !! ! is a hypergraph, where ! is the set of action nodes, ! is the sef of temporal nodes, !! contains quality edges, and !! precondition edges. !!"!# and !!"##$%& are special temporal nodes that denote the initial and current moments of time. !!"!# contains a special action !!"!# that is the constructor for all initial qualities, and !!"##$%& contains !!"##$%& , that is a temporary pseudo-destructor for all qualities that hold at the current moment. Extending preconditions. In the previous section, action preconditions merely requested the presence of qualities. We extend their definition, in order to incorporate temporal constraints. Let !" ! !!"#$%"! !"#$%! !"#$!!"#$%"! !"#$!!"#$%! be the set of temporal constraints. A precondition ! ! !, is recursively defined as: !! ! ! " !" !" !! " !"" !! , where !! ! !! ! ! and ! ! !! , !" ! !". Examples of preconditions are: !!!!, Q(x) before R(x,y), Q(x) before R(x,y) after P(y). Let !"!! ! !! ! !!!! an extension of !"!!" defined in the previous section, allowing actions to enforce temporal constraints between actions. Temporal models. Let !! !!! ! !! ! !"!! ! !""! ! ! ! ! be a temporal model. The states ! ! ! are no longer represented by !!"#$, but by a hypergraph !. Notice that all constructs are the same, except for the precondition function and the initial state structure. From any temporal model !! a non-temporal model !!" can be obtained by: (1) building !"!! from !"!!" in the following manner (where ! ! !! ! ! ! !! ! !!! ! !"! ! ! !). If ! ! !"!! !!! ! ! ! !"!!" !!! !! !!!! !!! ! ! !!!!!! !!! ! !"!! !!! ! !! ! !! ! ! ! !! ! !"!!" !!! and (2) transforming a hypergraph representation to one using the !!"#$ predicate, which can be done in the following manner: !! ! !! !!!! !!! !!! ! !! ! ! !!! !!"##$%& ! ! !!"#$!!! The above transformation flattens both preconditions and the hypergraph, thus obtaining a non-temporal view on the modeled domain. Action applicability in temporal models. Action applicability in temporal models generalises action applicability in non-temporal models. Intuitively, the process of finding a general substitution that satisfies all preconditions of an action is now

58

complicated by an additional constraint: qualities must also have specific temporal relationships. For each temporal constraint from !", we define a temporal relation over qualities. For any two qualities !! ! !!! ! !! ! and !! ! !!! ! !! !, such that !! ! !! , ! ! ! ! ! then: 1. if !! ! ! !! then !!"#$%" !!! ! !! !; 2. if !! ! ! !! ! ! !! then !!"#$!!"#$%" !!! ! !! !; Using these relations, their converses can be applied: 3. if !!"#$%" !!! ! !! ! then !!"#$% !!! ! !! !; 4. if !!"#$!!"#$%" !!! ! !! ! then !!"#$!!"#$! !!! ! !! !; Relations such as !!"#$%" and !!"#$% are useful to define action applicability. A precondition ! ! !! " !!! " !! ! !!!!! " !! (!!! ! !") of an action is considered temporally (but not fully) satisfied by qualities !! ! ! ! !! if: !!!! !!! ! !! !! !!!! !!! ! !! !! ! !!!!!! !!!! ! !! ! An action ! from a temporal model !! is applicable in a hypergraph ! if there exists a set of qualities !! ! !! such that: 1. each precondition ! is temporally satisfied by a subset of qualities !! ! !! ; 2. !!! !!! is an applicable textual action in the non-temporal model !!" associated to !! , built using the rules above; The above definition builds onto action applicability in non-temporal models, and adds temporal constraints between qualities. It is straightforward to see that the generalised action applicability defined above is a general case of Action-App and is, therefore, also NP-hard. 5

Conclusions

The NP-hardness result for action applicability may seem discouraging. However, as mentioned above, NP-hardness appears only in particular scenarios. The complexity of the model does not necessarily entail intractability. While there are models where the applicability problem is exponential, this isn’t necessarily true for all large models. The modeling method is the basis of a declarative language coupled with an interpreter based on CLIPS[11] and the JESS engine [12]. The language and interpreter are currently being developed. In connection with our method, verification and validation techniques are being researched. They aim at detecting inconsistencies in model specification. It is important to emphasize that our method has potential advantages to numerous applications such as specifying intelligent, time-dependent behavior for devices. Areas such as these remain to be further investigated.

59

References 1. OWL 2 Web Ontology Language, October 2009. http://www.w3.org/TR/owl2-overview/. 2. Galton A.P. Temporal logic, 1999. 3. Alessandro Artale and Enrico Franconi. A survey of temporal extensions of description logics. Annals of Mathematics and Artificial Intelligence, 30:171–210, March 2001. 4. Juan Carlos Augusto. The logical approach to temporal reasoning. Artif. Intell. Rev., 16:301–333, December 2001. 5. Time Ontology in OWL, September 2006. http://www.w3.org/TR/owl-time/. 6. Michael Fisher, Dov Gabbay, and Lluis Vila. Handbook of Temporal Reasoning in Artificial Intelligence. Elsevier Science Inc., New York, NY, USA, 2005. 7. Johannes Schmitt, Matthias Hollick, Christoph Roos, and Ralf Steinmetz. Adapting the user context in realtime: Tailoring online machine learning algorithms to ambient computing. Mobile Networks and Applications, 13:583–598, 2008. 10.1007/s11036-0080095-8. 8. Set intersection problem, March 2011. http://research.cs.queensu.ca/ cisc365/2010F/365 9. Regimantas PliuskeviÃ„ï¿!cius. Similarity saturation for first order linear temporal logic with unless. In JosÃƒÂ© Alferes, LuÃƒÂ"s Pereira, and Ewa Orlowska, editors, Logics in Artificial Intelligence, volume 1126 of Lecture Notes in Computer Science, pages 320– 336. Springer Berlin - Heidelberg, 1996. 10. James F. Allen and George Ferguson. Actions and Events in Interval Temporal Logic. Journal of Logic and Computation, 4:531–579, 1994. 11. Giarratano J. Clips reference manual, 1994. 12. Ernest Friedman Hill. Jess in Action: Java Rule-Based Systems. Manning Publications Co., Greenwich, CT, USA, 2003.

60

A Multidimensional Perspective on Social Networks Cristian Turcitu1, Florin Radulescu1 1

Computer Science Department, Politehnica University, Bucharest, Romania [email protected]

Abstract. Basically, a social network is a grouping of people with common interests, hobbies or social status. Lately, social networking has evolved from the real grouping of individuals (colleagues at work, at students at school, etc.) to the virtual grouping of users, with the aid of social networks websites. For some individuals using social networks has become a way of life, while for others it is just a mean of entertainment. In this article we will use a set of multidimensional scaling algorithms to analyze the perception that people have on social networks used nowadays. Keywords: MDS, social network, proximity, similarity, PROXSCAL, INDSCAL

1

Introduction

Multidimensional scaling represents a multitude of techniques used for geometrically representing the relationships between objects with high dimensionality. It was used for the first time in psychology [1, 2], being used to understand how people perceive similarity between different sets of objects. MDS is now used in several areas, like marketing, physics, social science and biology. Based on a matrix (called proximity matrix) obtained from the dissimilarities or similarities between input objects, the aim of a MDS processing is to find a configuration of points which can map the initial proximities to distances of a lower dimension. Given N – the number of input objects, D = (dij) in RNxN the matrix of similarities or dissimilarities between objects, solving a MDS problem means finding a configuration of points in a m < N dimensional space for which the difference below is minimal: || dij - d(Xi, Xj) ||

(1)

where d(Xi, Xj) represents the distance between the points Xi(xi1, xi2..., xim) and Xj(xj1, xj2, ..., xjm) in the new representation space. Distance can be defined as: !!!

!

!!!" ! ! !!" !!

!!!!!!!!!!!!!!!!!!! !! ! !! ! !

!!!!!!!!!!!!!!!!!!!!

!!!

where p # 1 can have several values. For p=1 the formula above is known as the Manhattan distance, for p=2, we have Euclidean distance and for p > 2 we talk about

61

the p-norm Minkowski distance [7]. In this paper we will refer to Euclidean distance when using the term “distance” to compare different objects. We have chosen this distance because the input data uses the same scale for the objects studied and in Euclidean measurements, adding new objects does not affect the distance between two initial points. 2

Different flavours of MDS

Based on the input proximity matrix, the MDS algorithms can be classified as metric and non-metric. In metric MDS the input data is quantitative (interval or ratio data), while non-metric MDS uses ordinal data for the proximity matrix. In metric MDS, the general stress formula is:

!!!!!!!!!!!!!!! !! ! ! !

!!! !!"

!!" ! !!!! ! !! !

!

!!!!!!!!!!!!!!!!!

where wij is a weight function that can be used in some types of MDS. Usually, in the above expression, the weight function equals the identity function, but there are cases when it can have other values, depending on the situation. 2.1

Non-metric multidimensional scaling

When rank order between dissimilarities is more important in processing data than the numerical value of dissimilarities non-metric MDS can be used. In non-metric scaling the proximities dij are replaced by disparities, !!" , which are sometimes called pseudo-distances because they are not real distances, they are obtained from dissimilarities using a function f: f(dij) $ !!"

(4)

which is monotonic increasing, meaning that for a pair of points a, b, if a < b, then f(a) % f(b). For non-metrical MDS the function from (3) becomes: !!!!!!!!!!!!!!! !! ! ! !

!!" !!" ! !!!! ! !! !

!

!!!!!!!!!!!!!!!!!

!!!

where ! represents the matrix of disparities !!" . Therefore, finding a non-metric MDS solution comprises of two steps: • finding a monotonic increasing function which transforms the dissimilarities into disparities; • minimizing the stress function from (5); For solving the first step Kruskal [4] proposed the use of pooled-adjacent-violator algorithm (PAVA). For further details and improvements on this algorithm, Zhou’s article can be consulted [5].

62

Minimizing the stress function is a complex problem which requires proper algorithms. According to Borg & Groenen [6], one common approach is SMACOF (Scaling by Majorizing a Complicated Function) – an algorithm that uses iterative majorization to reach a satisfactory solution. The main idea of iterative majorization is to replace a complicated function with another one, for which the minimization process is simpler. The article from [10] offers a detailed description of this algorithm. 2.2

Replicated scaling

The difference between classical algorithms and replicated MDS is that the latter uses more proximity (or disparity for non-metric scaling) matrices, one matrix for each subject participating to the experiment. RMDS assumes that the final dimensions are the same for the input matrices, and that every subject had the same stimuli when he participated at the experiment. In RMDS, the stress formula becomes: !!!! !! ! ! !

! !

! !!!

!!! !!"#

!!"# ! !! !!! ! !! !

!

!!!!!!!!!!!!!!!!!!!

where s represents the number of input sources(proximity matrices), wijk is the weight between objects i and j from the k-th input matrix, and !!"# is the dissimilarity between objects i and j, in the matrix number k. 2.3

Weighted MDS

Weighted MDS is the next major development and it generalizes the distance model. It differs from RMDS in that the input dissimilarity/disparity matrices for each subject have various weights. Because WMDS model incorporates individual differences for each participating subject, it is sometimes called “individual differences scaling” (INDSCAL). In the function from relation (6) the weights represent the individual differences for every subject, and in WMDS have positive values lower than 1. 3

Social networks and MDS

According to [10], a social network is any website designed to allow multiple users to publish content themselves. The information may be on any subject and may be for consumption by (potential) friends, mates, employers, employees, etc. The sites typically allow users to create a “profile” describing themselves and to exchange public or private messages and list other users or groups they are connected to in some way. There may be editorial content or the site may be entirely user-driven. From the definition above it is obvious that on the Internet there are a multitude of websites utilised for social networking. We want to see how these networks are perceived and thus we will make an analysis based on the input data of 30 subjects.

63

3.1

Choosing the input objects

For the experiment there were selected the following social networks: Twitter(T), Netlog(N), Linkedin(L), Hi5(H), Friendster(Fr), Flickr(Fl) and Facebook(Fa). The abbreviations from parentheses will be used for describing the corresponding values for each social network. In order to collect the data, we have created a survey with each pair of objects (social networks) and we asked participants to rate their perception on differences between each pair. A scale from 1 to 5 was used in the experiment, a mark closer to 5 meaning that the social networks compared were quite different according to the participant, while a mark closer to 1 would mean that the compared objects were perceived as being similar. At the end of the questionnaire, the subjects were asked to give a mark from one to ten for each social network, depending on how familiar were they with them: a lower mark meant that the subjects did not use too much the social network, while a mark closer to ten would mean they are very familiar with it. There were no missing values or marks outside the range, so we could consider that the input data is pre-processed. At the experiment participated 30 persons, each having to grade a number of 7*(71)/2 = 21 pairs of social networks with marks from one to five and to give grades from one to ten to the social networks involved in the experiment. 3.2

Selecting IBM SPSS Statistics 19 for analysis

SPSS is one of the software programs that is widely used for statistical analysis in social sciences [9]. Market researchers, government, survey companies, education researchers are some of the categories that utilise this product. For processing the data we will use the module PROXSCAL of the last version of the IBM software: SPSS 19. We have selected PROXSCAL module because it implements the SMACOF algorithm that we previously discussed. PROXSCAL has also several functionalities that helped us to specify the number of iterations, the initial configuration of the solution matrix, or the value of the stress at which the algorithm would stop. The module uses several types of input: full symmetrical matrix, lower or upper triangular matrix, rectangular matrix. For our example, a lower triangular matrix will be created from the data provided by every participant in order to have the appropriate input for the PROXSCAL procedure. For the input data two types of analyses will be made. The first one will be a replicated analysis, where the algorithm will run three times: firstly with 10 subjects, secondly with 20 and the last time with 30 subjects. This experiment will show how important is the number of participants and what is the difference between the final results. In the second experiment will consider as being weights the level of knowledge for every social network, as the subjects specified. For a better analysis, we will compare the replicated MDS algorithm with the weighted one and we shall draw our conclusions

64

3.3

Data processing

We will use for the beginning the data provided by the first participants at the experiment. In the figures below, we can observe that increasing the dimensionality has a different effect on replicated MDS than on weighted one. While in replicated processing, the stress function decreases while the dimension increases, in the weighted model we have a slow increase when we go from two to three dimensions, and from three to four dimensions there is a decrease of the stress function. After four dimensions, the weighted MDS does not have the desired behaviour. Thus, for both replicated and weighted models we will analyze a three dimensional and a bidimensional solution.

Fig.1. Scree plot for the 10 participants: replicated (left) versus weighted (right) MDS

The solution obtained if we use three dimensions can be observed in Fig.2. It is hard for the human eye to analyze a solution if the dimensionality is greater than three. The replicated solution displays the social networks all over the chart, so we could not figure out a clear conclusion regarding the general perception on these social networks. On the other side, for the weighted MDS, one can see that four objects are grouped together, while three of them are separated.

Fig.2. A three-dimension solution: replicated (left) and weighted (right) MDS

65

The bi-dimensional solution offers a better view of the objects involved in processing.

Fig.3. Two dimensional solution for replicated and weighted MDS

Replicated MDS gives an uniform distribution of the input objects – we can see although that networks Twitter, Facebook and Netlog can form a small group, which can mean that the participants consider that these objects are related. On the other side, the weighted solution is more intuitive. If we consider how familiar the subjects are with the social networks in the experiment, we can see a cluster of five networks: Netlog, Hi5, Linkedin, Flickr and Friendster. Another observation is that Twitter and Facebook social networks are separated from the group, the users do not take one for another. Analyzing the input data, for the first ten participants, Facebook is very familiar, while Twitter is quite unfamiliar - thus, weights have a great impact on the way these social networks are perceived. When we increase the number of participants to 20, the results obtained are slightly different. For replicated MDS, a two dimensional solution will bring a stress function value below 0.04, while a three dimensional solution will further decrease the stress to under 0.02. The weighted model although, has an instable behaviour, after four dimensions the stress function increasing.

66

Fig.4. Scree plot representation for 20 subjects

When considering a three dimensional solution, the differences are not very clear, as we can observe below.

Fig.5. The object points in a 3D space

The final result in this situation is similar with the result obtained when using only 10 participants, with one exception. In weighted MDS, Facebook has joined the cluster of five networks, but Linkedin is not in that group anymore.

67

Fig.6. Two dimensional solution for 20 participants: replicated and weighted MDS

For 30 participants at the experiment the behaviour is similar with the previous solution.

Fig.7. Replicated and weighted MDS scree plot for 30 participants

Choosing a two-dimensional solution, the output can be observed in the figures below.

Fig.8. Social network disposal in a bi-dimensional space

68

The similarity with the previous result is obvious, which can lead to the conclusion that a further increasing in the number of participants will not bring radical changes to the solution. Twitter is seen as a social network with special functionality, and so is Linkedin, while the rest of objects are perceived as being similar. 4

Conclusions and future work

Both replicated and weighted scaling was used to find how are social networks perceived, but only weighted model offered a satisfying solution. Replicated MDS solution could not be interpreted very clear, and the answer can be found in the weighted one. When we used in our analysis the level of knowledge of every person regarding the social networks, the results were more accurate. This means that, when the participants were not familiar with some social networks, the answers he provided regarding the differences between them were not very specific, so the weights were useful to sort that out. Another conclusion is that the solution depends on the number of subjects, but after the number reaches a certain value, the changes in the final results are insignificant. The first 10 participants considered Facebook as being separate from the cluster, while the solution with 20 and 30 participants changed that: Facebook became part of the same group with Hi5, Netlog, Flickr, and Friendster. Consequently, if there are few participants, the results may be unconvincing. MDS is an exploratory set of techniques that can be used in various fields. In this paper we have used non-metric scaling to analyze the perception of youngsters on social networks. The interpretation of the output is challenging and highly subjective – other persons could have found slightly different approaches to the results. In this experiment, the persons who completed the survey were youngsters between 20 and 26 years, all with university education. A future challenge might be to analyze a large group of social networks, from different regions of the country, with different backgrounds or social category, but also having different age. References 1. Guttman, L. (1954). A new approach to factor analysis: the radex. In. P. Lazarsfeld (Ed.), Mathematical thinking in the behavioral sciences (pp. 258-348). New York: Free Press; 2. Young, G. & Householder, A. S. (1941). A note on multidimensional psycho-physical analysis. Psychometrika; 3. Shephard, R.N. (1962). The Analysis of Proximities: Multidimensional Scaling with an Unknown Distance Function, Psychometrika, 27, 125-140; 4. Kruskal, J.B. (1964). Nonmetric multidimensional Scaling: a numerical method, Psychometrika, 29, 115-129; 5. Zhou, W. (2004). A review and implementation of some approaches to metric clustering; 6. Borg, I., & Groenen, P. (2005). Modern multidimensional scaling: theory and applications, Second Edition. New York: Springer; 7. Elena Deza & Michel Marie Deza (2009) Encyclopedia of Distances, page 94, Springer;

69 8. Gansner, E.; Koren, Y.; North, S. (2004), "Graph Drawing by Stress Majorization", Proceedings of 12th Int. Symp. Graph Drawing (GD'04), Lecture Notes in Computer Science, Springer-Verlag, pp. 239–250; 9. Argyrous, G. (2005). Statistics for Research: With a Guide to SPSS, Sage: London. 10. What is social networking? http://www.whatissocialnetworking.com/

70

Quantitative Models of Communities of Practice A Call for Research Nicolae Nistor1 1

Ludwig-Maximilians-Universität, Faculty of Psychology and Educational Sciences, Leopoldstr. 13, 80802 München, Germany [email protected]

Abstract. The participation in communities of practice (CoPs) is an important way to construct and share knowledge of various domains. However, the relationships between the fundamental CoP notions are not yet sufficiently researched. While there are numerous qualitative CoP studies in the educational research, quantitative studies are still lacking. The paper at hand proposes a quantitative causal model of CoPs deduced from actual research literature. Further, it outlines three quantitative studies that confirm this model. Their validity is however limited by the relatively small samples, and the methodology that may lead to common methods bias. In conclusion, the author calls for interdisciplinary CoP research based on the automated analysis of the social structures and the collaborative discourse that CoPs build in their practices. Keywords: communities of practice, participation, expert status, cultural artifacts

1

Rationale

Communities of practice (CoPs) are large groups of people sharing goals, activities and experience in the frame of a given practice. Participation in CoPs leads to the accumulation of experience, it stimulates the social construction of knowledge and the development of expertise [1, 2]. Examining the CoP literature, we observe that it is sustained mainly on qualitative research; there are hardly quantitative studies about knowledge construction, learning and individual development in this context, particularly about the relationship between the fundamental notions of CoPs. Three studies conducted by the author [3, 4, 5] propose a causal model that includes the individual knowledge, the intensity of participation, the expert status, and the community member’s contribution to artifact development. Due to the relatively small samples and to missing replications, the study has limited validity. Also, newer technologies that can provide a deeper analysis of the interaction contents and of their social frame are not yet involved. Therefore, the author calls for quantitative research on CoPs that fills this gap and extends the validity of the previous research. Since replications of available studies, nevertheless enhanced by more advanced technology such as automated social network analysis and natural language

71

processing, are recommendable as a starting point for further studies, the paper at hand gives a short overview on three recent quantitative studies of CoPs. In conclusion, the strengths as well as the limitations of these studies are outlined in order to suggest new directions for further research. 2

Theoretical background

Communities of practice (CoPs) are groups of people sharing goals, activities and experience in the frame of a given practice [1, 2]. Etienne Wenger [2] explains the term community as "a way of talking about the social configurations in which our enterprises are defined as worth pursuing and our participation is recognizable as competence" and practice as "a way of talking about the shared historical and social resources, frameworks, and perspectives that can sustain mutual engagement in action". Most characteristic of CoPs is that practice continues over lengthy periods of time. In most cases, the termination of activity is neither planned nor foreseen. The special interest of the Educational Sciences for CoPs resides in the benefic social and cognitive effects of participation in CoPs, which can hardly be reached in formal settings such as the traditional school. Participation in CoPs is assumed to lead to the accumulation of experience, to stimulate the social construction of knowledge and the development of expertise [1, 2, 6, 7, 8, 9, 10]. As Lave and Wenger emphasize, “a community of practice is an intrinsic condition for the existence of knowledge, not least because it provides the interpretive support necessary for making sense of its heritage” [1]. However, "researchers insist that there is very little observable teaching; the more basic phenomenon is learning. The practice of the community creates the potential ‘curriculum’ in the broadest sense”. This view “opens an alternative approach to the traditional dichotomy between learning experientially and learning at a distance, between learning by doing and learning by abstraction” [1]. Examining the CoP literature, we observe that it is sustained mainly on qualitative research; there are hardly quantitative studies about learning and development in this context, or empirical evidence on the development of cultural artifacts in communities. Quantitative findings relating to the relationship between the fundamental notions of CoPs are still scarce. To introduce recent research, the central notions related to CoPs are defined in the following. 2.1

Domain knowledge

CoP research relays on a social-constructivist understanding of knowledge that deals both with individual and with social aspects. The individual aspects correspond to the general view of domain knowledge. Etienne Wenger [2] builds the theory of situated learning starting from the premise that “knowledge is a matter of competence with respect to valued enterprises – such as singing in tune, discovering scientific facts, fixing machines, writing poetry, being convivial, growing up as a boy or a girl, and so forth”. So far, Wenger’s view of knowledge corresponds to the generally accepted definition of expertise, as advanced and reproducible knowledge and skills in a specific domain. As Ericsson [11] formulates, expertise “refers to the characteristics,

72

skills, and knowledge that distinguish experts from novices and less experienced people. In some domains there are objective criteria for finding experts, who are consistently able to exhibit superior performance for representative tasks in a domain.” In CoPs, a member’s knowledge can vary from little or no knowledge in the case of novices, to expert knowledge. As people learn, knowledge may grow in time. As a rule of thumb confirmed by empirical findings, full expertise requires approximately ten years of intensive, intentional, reflective study and training (“deliberate practice”) in the given domain [12]. Ideally, the longer the time-on-practice becomes, the more experience may be acquired and the more individual expertise may develop. Attempting to formulate a quantitative model of CoPs, we regard domain knowledge in the sense mentioned above [11], together with the time spent in the CoP [12] as independent variables that predict participation. Nevertheless, knowledge or expertise may be acquired in the community practice as Lave and Wenger claim [1], but it can be brought from outside as well [9], which may imply more complex relationships between these three variables. 2.2

Participation

Wenger [2] defines participation as a notion that “refers not just to local events of engagement in certain activities with certain people, but to a more encompassing process of being active participants in the practices of social communities and constructing identities in relation to these communities. Participating in a playground clique or in a work team, for instance, is both a kind of action and a form of belonging. Such participation shapes not only what we do, but also who we are and how we interpret what we do.” This definition is complemented by the numerous examples of CoPs described in the research literature, which reveal also differences in the intensity of participation, depending on the members’ domain knowledge. Members with higher expertise are involved in more activities, including those with a higher degree of difficulty and responsibility. “A newcomer’s tasks are short and simple, the costs of errors are small, the apprentice has little responsibility for the activity as a whole. A newcomer’s task tends to be positioned at the ends of branches of work processes, rather than in the middle of linked work segments” [1]. Thus, expert status may be regarded as a consequence of central, i.e. intensive participation in the most important and challenging activities of the CoP. In other words, participation may be regarded as a mediator of the relationship between domain knowledge and expert status. 2.3

Expert status

Wenger [2] regards identity as “a way of talking about how learning changes who we are and creates personal histories of becoming in the context of our communities.” His definition of identity having full CoP membership at its highest point – and goal of the learning trajectory – implies the expert status in the CoP context. A full member of a CoP is an expert who not only possesses superior knowledge and skills, i.e. intrinsic expertise, but who is also socially recognized as an expert. Thus, expert

73

identity comprises besides the domain knowledge of the CoP member, above all the expert status, which is the result of negotiation with and recognition by other CoP members, which takes place in the context of participation. “We conceive of identities as long-term, living relations between persons and their place and participation in communities of practice. Thus identity, knowing, and social membership entail one another” [1]. An expanding line of research analyzes the individual position in social networks by means of graph theory, and provides thus mathematical definitions of the expert status [13]. While Lave and Wenger in their first approach to CoPs describe mainly evolutions from the novice to the expert identity, Wenger [2] differentiates several learning trajectories, such as inbound, peripheral, insider etc. Not every CoP member follows a learning trajectory from novice to expert, in this respect more recent literature [7] emphasizes the complexity of the learning trajectories. In conclusion, a quantitative causal CoP model may present expert status as the result of negotiation with and recognition of other CoP members, thus directly influenced by the intensity of participation in the community practice. 2.4

Cultural artifact development

Generally, the term artifact designates a material as well as an immaterial product of human activity. In the cultural context of a community practice, a cultural artifact has a specific meaning, related to the activities being performed, and to the CoP members performing them. As an example of cultural artifact used in academic CoPs, web pages usually present the members of researcher teams, along with their research, publications and teaching. Thus, they correspond to Wenger’s [2] assertion, that “artifacts are boundary objects, and designing them is designing for participation rather than just use. Connecting the communities involved, understanding practices, and managing boundaries become fundamental design tasks”. The presentations of academic practice embodied in websites may be regarded as parts of the academic discourse. According to Gillespie and Zittoun’s [14] conceptualization of the use of symbolic resources, academic web pages play the role of tools that mediate researchers’ acting on and communication with the academic world. The “leading voices” are the senior scientists; the younger ones “learn to talk” within the academic discourse, as asserted by Lave and Wenger [1]. Wenger [2] claims that the duality of participation and reification of knowledge, i.e. artifact production, is the key to learning processes in the context of CoPs. Through participation, knowledge is both constructed and reified. Conversely, reified knowledge enables further participation. Nistor [15] proposes that both collaborative knowledge construction and reification are not equally accessible to all community members. While regular members and experts participate, reify their experience, and continuously construct knowledge, beginners must first absolve a cognitive apprenticeship before they gain full access to all the community’s activities and resources. Consequently, members’ contribution to the production of cultural artifacts may be modeled as being influenced by the expert status.

74

2.5

Research model

As a summary of these theoretical considerations, the research model proposed in this work will include domain knowledge and time in the community as independent variables, the intensity of participation will mediate their influence on the expert status, which will further influence members’ contribution to artifact development. The research model is depicted in fig. 1.

Fig.1. Causal quantitative model of expertise, intensity of participation, identity and contribution to cultural artifact development in CoPs

3 3.1

Three quantitative studies of CoPs Study 1: A German academic CoP

The first quantitative study aimed at verifying the proposed research model (fig. 1) was conducted in an academic CoP at a German university, in the field of Psychology and Educational Sciences. The sample consisted of N = 70 persons of different expertise levels, belonging to two different researcher teams and to a management unit. The variables domain knowledge, intensity of participation and contribution to artifact development were measured by means of a questionnaire survey. The first two variables were operationalized based on seven dimensions of the academic practice: research, scientific publications, fund raising, teaching, young researcher support, coordination and administration, cooperation with other researcher teams. For domain knowledge, the corresponding items comprised the statement “I have knowledge and skills in the domain X of academic practice” and response options from “totally agree” to “totally disagree”. Similarly, for the items of the scale intensity of participation the participants had to rate the intensity of their professional activity according to the same dimensions (“I have much to do in the domain X of academic practice”, with the same response options). The participants’ contribution to artifact development was measured by rating how far they usually contribute to updating and (re-) designing the web site of the own research and teaching (and respectively administrative) unit. The variable expert status was determined in two steps. First, the participants were given the list of all the persons involved in the study, and were

75

asked to rate how much they have common activities with each of them. Second, the obtained data served as input for a social network analysis using the software UCINET version 6. The expert identity of each participant was then extracted as degree of centrality (in-degree), i.e. the sum of others’ ratings referring to one person. The model variables had medium mean values. All the correlations hypothesized by the research model (fig. 1) were found to be significant, with regression factors ranging from medium to very high. The only exception was the effect of the time in the community on the intensity of participation, which was not significant. Intensity of participation mediated significantly the relationship between domain knowledge and expert status; expert status mediated significantly the relationship between intensity of participation and artifact development. The variance of artifact development could be explained to 23%, the variance of expert identity to 21%, and the variance of intensity of participation to 81%. This study is extensively presented in [3]. 3.2

Study 2: A German IT users CoP

The second study was conducted in the same academic community. This time, the community practice focused by the study is the use of IT hardware and software for academic purposes. The IT users CoP has mostly informal character; the relatively small institutionalized part of the community is represented by the IT support group. The artifact development considered was the production of written instructions about the use of computers in the frame of academic work. Such instructions were similar to hardware or software user manuals, however they were shorter and not necessarily connected to one another, but referred to the technology use that was most frequent in the academic CoP. The studied sample comprised N = 72 participants with nontechnical and technical professions, faculty, student assistants, secretaries and technical staff. While the setting and the design of the study were similar to study 1, some of the variables of the research model (fig. 1) were defined and measured in a different way. Regarding domain knowledge, the participants had to self-evaluate their knowledge related to software use (e.g. office software), hardware and software installation, network administration etc. For intensity of participation, the participants rated how often they helped colleagues within activities related to the already questioned dimensions of the domain knowledge. For expert status the participants rated the occurrence of their helping colleagues from various categories (e.g. secretaries, student assistants etc.). The other variables were measured similarly to study 1. In spite of the somewhat different setting and methodology, study 2 brought roughly the same results as study 1. All the correlations hypothesized by the research model were significant, with regression factors ranging from medium to high. Intensity of participation mediated significantly the relationship between domain knowledge (i.e. both domain knowledge and time in the community) and expert status; expert status mediated significantly the relationship between intensity of participation and artifact development. This time, too, the effect of the time in the community on the intensity of participation, which was not significant. The variance of intensity of participation could be explained to 50%, the variance of expert status to

76

43%, and the variance of artifact development to 20%. This study is extensively presented in [4]. 3.3

Study 3: A Romanian academic CoP

Study 1, carried out in Germany, was replicated at a Romanian university, in the same field of Psychology and Educational Sciences. The sample consisted of N = 67 members of the institution with different expertise levels, belonging to four different researcher teams. The methodology was entirely similar to study 1. All the hypothesized correlations were found to be significant, with regression factors ranging from medium to very high. Again, the effect of time in the community on the intensity of participation was not significant. Intensity of participation mediated significantly the relationship between domain knowledge and expert status. Unlike study 1, the effect of expert status on the participation to artifact development was not significant. Instead, a direct influence of the participation intensity on the artifact development was found. The variance of artifact development could be explained to 11%, the variance of expert identity to 22%, and the variance of intensity of participation to 84%. This study is extensively presented in [5]. 4

Conclusions

Although different CoPs were studied using different methodologies, the results were quite similar, which confirms the research model deduced from Lave and Wenger’s theory of situated learning [1, 2]. The only part of the research model that was not confirmed by the findings of study 3 was the relationship of artifact development to the other model variables, which needs more attention. This difference may be explained by the differences in culture and organization, or by the different attention paid to web pages in German and Romanian universities. A limitation of the presented three studies consists of the relatively small samples. A total sample size of N = 209 ensures a certain result validity, nevertheless larger samples of CoPs with different practices will provide more robust findings. Another limitation consists of the common methods bias [16], which is likely to have occurred while measuring both domain knowledge and intensity of participation by means of questionnaire survey. Both limitations may be eliminated using automated analysis of the social structures and the collaborative discourse that CoPs build in their practices. Domain knowledge may be indicated, besides self-evaluation and knowledge tests, by the topics involved in CoP members’ discourse, their width and depth, the frequency of misconceptions, the ability of CoP members to bring into the discourse knowledge that is helpful for problem solving etc. Also, expressions of interest for certain topics may give insight in a CoP members’ expertise. Time in the community appears to be rather inconclusive according to the studies presented above; there are however numerous and robust findings of the expertise research that suggest its important role in CoPs, so that it should not be excluded from the research model. Time in community can be easily determined as individual presence in a CoP. On the other hand, in virtual communities only the active participation, i.e. the individual contributions to the community discourse are visible.

77

Lurking may have positive cognitive effects; nevertheless it may remain invisible for most research instruments. Intensity of participation has been determined in many studies of the last two decades [17] based on various approaches, and with encouraging results. Many of these approaches may be automated, too. Expert status is defined in many recent studies by means of graph theory and social network analysis [13]. This line of research usually focuses on the frequency or intensity of interactions. The combination of social network, content analysis and recommender systems still needs to be explored. Artifact development is probably the CoP dimension with the lowest occurrence in the psychological research literature. The role of cultural artifacts in the social practice and the ways in which they accumulate reified knowledge still need to be conceptualized and thoroughly studied. The automated analysis of these variables has the potential of overcoming several limitations of the actual research, including the relatively small samples and the common methods bias. Interdisciplinary research will thus provide valuable support for understanding informal learning and knowledge construction in CoPs. References 1. Lave, J., & Wenger, E. (1991). Situated learning. Legitimate peripheral participation. Cambridge: University Press. 2. Wenger, E. (1999). Communities of practice. Learning, meaning, and identity. Cambridge, UK: University Press. 3. Nistor, N. & August, A. (2010). Toward a quantitative model of communities of practice: Intrinsic expertise, participation, expert identity and artefact development in an academic community. Paper presented at the EARLI SIG 2+3 Conference „Moving through cultures of learning“, Utrecht, September 2-3, 2010. 4. Nistor, N. & Schustek, M. (2011, September, submitted). Verifying a quantitative model of communities of practice in a computer users’ community. Paper submitted to the Sixth European Conference on Technology Enhanced Learning (EC-TEL), Palermo, 20-23. September, 2011. 5. Nistor, N. & Fischer, F. (in preparation). From expertise to expert identity: Predicting the expert status in communities of practice. Journal of the Learning Sciences. 6. Bereiter, C. (2002). Education and mind in the knowledge age. Mahwah, NJ: Lawrence Erlbaum. 7. Boylan, M. (2010). Ecologies of participation in school classrooms. Teaching and Teacher Education, 26 (1), 61-70. 8. Engeström, Y. & Sannino, A. (2010). Studies of expansive learning: Foundations, findings and future challenges. Educational Research Review, 5 (1), 1-24. 9. Fuller, A., Unwin, L., Felstead, A., Jewson, N. & Kakavelakis, K. (2007). Creating and using knowledge: an analysis of the differentiated nature of workplace learning environments. British Educational Research Journal, 33 (5), 743-759. 10. Paavola, S., Lipponen, L., & Hakkarainen, K. (2004). Models of innovative knowledge communities and three metaphors of learning. Review of Educational Research, 74 (4), 557-576.

78 11. Ericsson, K. A. (2006a). An introduction to Cambridge Handbook of Expertise and Expert Performance: Its development, organization, and content. In K. A. Ericsson, N. Charness, P. Feltovich & R. R. Hoffman (eds.), Cambridge handbook of expertise and expert performance (pp. 3-20). Cambridge, UK: Cambridge University Press. 12. Ericsson, K. A. (2006). The influence of experience and deliberate practice on the development of superior expert performance. In K. A. Ericsson, N. Charness, P. Feltovich & R. R. Hoffman (eds.), Cambridge handbook of expertise and expert performance (pp. 685–706). Cambridge, UK: Cambridge University Press. 13. Borgatti, S. P., Mehra, A., Brass, D. J. & Labianca, G. (2009). Network analysis in the social sciences. Science, 323, 892-895. 14. Gillespie, A. & Zittoun, T. (2010). Using resources: Conceptualizing the mediation and reflective use of tools and signs. Culture & Psychology, 16 (1), 37-62. 15. Nistor, N. (2010). Knowledge communities in the classroom of the future. In K. MäkitaloSiegl, F. Kaplan, J. Zottmann & F. Fischer (Eds.). Classroom of the future. Orchestrating collaborative spaces (pp. 163-180). Rotterdam: Sense. 16. Podsakoff, P. M., MacKenzie, S. B., Lee, J. Y. & Podsakoff, N. P. (2003). Common method bias in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88 (5), 879-903. 17. Nistor, N. & Neubauer, K. (2010). From participation to dropout: Quantitative participation patterns in online university courses. Computers & Education, 55 (2), 663672.

79

A Rule-based System for Evaluating Students‘ Participation to a Forum Iulia Pa&ov 1, 'tefan Tr(u&an-Matu 1, Traian Rebedea 1 1

University “Politehnica“ of Bucharest, Computer Science Department, 313 Splaiul Independentei, 060042 Bucharest, Romania {iulia.pasov, traian.rebedea, stefan.trausan}@cs.pub.ro

Abstract. Online discussions play an important role in student learning, increasingly becoming part not only of the distance education process. The purpose of this study is to provide a tool used in computer mediated communication and designed to support educational experiences. The application is based on the Community of Inquiry model, which consists of three elements: the cognitive presence, the social presence and the teaching presence. It implements a rule-based algorithm that evaluates participations to forums according to a complex set of patterns. Keywords: Artificial Intelligence, Rule-based System, Forum, Natural Language Processing, Computer Supported Collaborative Learning (CSCL)

1

Introduction

At a time when computer technology is advancing with breakneck speed and when software developers are praising to have designed software applications with artificial intelligence, it is time to go back over 60 years ago to Alan Turing’s question from “Computing Machinery and Intelligence”: “Can machines think?” [1]. As a matter of fact, the question dates back to such philosophers like Plato, Aristotle or even the Sankhya and Yoga schools from Hindu philosophy, who attempted to resolve the mind-body problem, which tries to find an explanation of the relationship that exists between minds, or mental processes and bodily states or processes. After nearly 3 000 years we cannot give an answer: all we can say is that machines can think in principle, but not in the same way that we do. The computer, as a language machine, uses and manipulates symbols, words, with no respect to their interpretation or meaning. It cannot understand our language, but while relations between meanings can be reflected in precise rules, the computational manipulations make sense. Therefore, it is not right to assume that these manipulations have anything to do with meaning. Instead of trusting computers to understand natural language, which is not yet possible, we can see them as a way of organizing, searching and manipulating texts that are created by people, in a context, and ultimately intended for human interpretation. [2]

80

1.1

Forums in Computer Supported Collaborative Learning

By participating to forum discussions, students can interpret and analyze other’s writings, reflect on their knowledge and readings, present their point of view, and provide pointers to information that support their ideas.[3] The evaluation of several forum messages offers the instructor a number of challenges. Since two common goals of discussion forum assignments are to support collaborative learning and build a community of learners, it is not enough to evaluate each contribution at a time, but together with others, as a whole. Although forums are widely used to support a variety of course requirements, the tutor is faced with the difficulty of interpreting and evaluating the learning and quality of participation reflected in student contribution. Since a student doesn’t post only to a single topic, and to hundreds, of different types, over weeks, the view of its individual experience is not clear. In order to help him, the CoI (Community of Inquiry) model was introduced. This model classifies the students’ participations into up to three categories (cognitive presence, social presence and teaching presence) 2

The Community of Inquiry Model

Interaction is considered essential to learning experiences from the sociocultural constructivist perspective [4], which assumes that members of a community who participate to a communication develop more knowledge. The student’s potential capacity for intellectual growth is considered to be enhanced by the presence of guidance, represented by a tutor or colleague through interaction. In 1989 Michael Moore introduced three types of interaction now widely described and accepted in the field of distance education: learner-content, learner-instructor, and learner-learner interactions. [5] Extending on these basic interaction types and aiming to contextualize the resulting interactions, in Garrison, Anderson and Archer developed the Community of Inquiry model.

Fig.1 Community of Inquiry: Communication Medium [6]

81

As observed in Fig. 1, a useful learning experience is embodied in a community of inquiry composed of teachers and students, key participants in an educational process. The model assumes that learning takes place in the community through the interaction of three elements. These essential elements are: the social presence, the cognitive presence and the teaching presence. Each of these presences can be identified with certain indicators represented by various phrases, keywords or their synonyms. For a better structure, the indicators were grouped in subcategories that indicate more clearly the aspect of each element from the group. 2.1

Cognitive Presence

The Cognitive presence is a vital element in critical thinking, a process and outcome that is frequently presented as the ostensible goal of all higher education. [6] Table 1. Cognitive Presence Indicators Code

Triggering Event

Cte

Exploration

Ce

Integration

Ci

Resolution

Cr

2.2

Indicators Recognition of a problem, perhaps from experience, expressing surprise or discomfort, questions, demanding explanations; e.g.”I am not quite sure what […] is about – should we start by clarifying what we mean by[…].” Exchange of information clarification of terms or situations, discussing ambiguities, searching explanations; e.g. “I think this is an interesting topic, but others have a different perception of it, to mine and we have to take this into account” Integration of knowledge and ideas into coherent explanations, testing possible insights about the problems; e.g. “I observed all types of behaviors on the hospital wards, but you have to reflect on the reasons why people behave as they do in order to […]and the easiest course of action is not always the best”. Reflecting on the effectiveness of different solutions or concerns, exploring meanings, agreement or differences; e.g. “Perhaps the best way of approaching this problem is not by […], but by asking questions in such a manner […], which would improve inappropriate practices by […] by making them think about their actions”.

Social Presence

Social presence is defined as the ability of the participants in the Community of Inquiry to project their personal characteristics into the community, presenting themselves to other participants as real people [7].

82 Table 2. Social Presence Indicators Code Emotional Expression

See

Open Communication

Soc

Group Collaboration

Sgc

2.3

Indicators Sharing emotions and feeling; expressing both conventional and unconventional emotions, humor, irony, self-revealing; e.g. “I was so angry[…] I could not understand him[…]”. Recognition of other participants and their contribution, encouraging those and their messages; e.g. “In your last message you referred to[…].” I really liked your interpretation of that situation” Encouraging group interactions, accepting differences; indicated by addressing by „we”/”us”, references to participants using names, etc. e.g. “I think that John summarized our discussions very well...”

Teaching Presence

The teaching presence consists of two general functions, which may be performed by any participant in the Community; however, in an educational environment, these functions are likely to be the primary responsibility of tutors. [8] Table 3. Teaching Presence Indicators Code Instructional Management

3

Tim

Direct Instruction

Tdi

Group Cohesion

Tgc

Indicators Facilitates the development of group organizing by initiating discussions, setting rules and labels; e.g. “In our initial face to face meeting we decided to deal with….” “We must finish this discussion by Friday…” Understanding when the group has reached a dead end and helps participants to overcome this situation, using knowledge or references from outside the group; e.g. ”If you want to upload an attachment just click on….” Facilitates group collaboration by identifying agreements or disagreements, providing an ideal environment for discussions, summarizing, using key-questions, confirming interesting comments of other participants, encouraging everybody to participate; e.g. ‘Any thoughts on this issue?’ ‘Anyone care to comment?’ ‘I think we’re getting a little off track.’

Related Work

Due to the need to extract information from various sources, verbal or textual, several applications were developed, most aiming to detect and interpret emotions. An application to implement an automated system for detecting social, cognitive or teaching presences in forums was not developed yet. The first application presented uses a rule-based system for sentiment analysis, while the second uses several indicators to classify forum messages.

83

3.1

SmartSense

SmartSense is a feature developed by RightNow Web, aiming to identify users’ different emotional states or attitudes based on words and language they use when submitting questions. This feature enables companies to improve services to those customers who are unhappy with their products or services. SmartSense scans input and assigns an overall emotive index between -3 and 3. SmartSense uses fuzzy logic, natural language processing techniques and a list of emotional words to determine the emotive rating of description or solution fields, and rates each with a value. [9] 3.2

The DIAS System[10]

The DIAS (Discussion Interaction Analysis System) is mainly developed to offer extended interaction analysis support by providing a wide range of indicators, used in different learning situations, to all discussion forum users (students, teachers, researchers) appropriate for their roles in learning activities. The DIAS System aims to offer direct assistance to students participating to an online discussion (forum) that could support them by making them aware of their actions or behaviors, as well as their collaborators’. In parallel, persons who monitor these discussions (teachers mostly) are helped to identify difficulties during learning situations, so they could interfere in the communication. The teacher is given the opportunity to choose the appropriate set of indicators (e.g. “collaboration quality”, “classification”, “activity”, “contribution”) for each student, depending on their personality or the learning activity they are involved in. In order to detect different learning situations, students’ results and performances must be computed. Therefore quantitative (direct scores from assessments) and qualitative (metadata descriptors of learning objects and the learning process) are used. [10] 4

The Application

The application was designed to analyze forum messages and assess each student’s participation in them. It was tested on a corpus of forum messages from students from the School of Medicine of the University of Manchester. The application consists of two modules: the user interface, which should be as simple and intuitive as possible, and the rule-based system designed to identify the defined components of a community of inquiry: the social presence, the cognitive presence and the teaching presence. The rule-based system finds certain patterns in messages, and classifies each message to one or more types of presences. The students in whose messages all three types of presences were indentified, are the ones who participate most to the conversation. The definition of the cognitive, social and teaching presences enables a more complex analysis of students’ participation in conversations, so persons with long but irrelevant contributions to the forum are not considered to participate more than the ones who engage in the conversation from all possible aspects (cognitive,

84

social and teaching). As shown in Fig. 1, the true educational experience is reached where all the elements are present.

Fig.2. Class Diagram

The dataset is read from an XML file and consists of approximately 400 forum messages from different conversation threads, some annotated with relevant subcategories (cte, ci, ce, cr, see, soc, sgc, tim, tdi, tgc). Among the columns of the table, one can be modified by the user: the one containing annotations read from the dataset. By selecting a message from the table, the user can change the annotations or visualize the entire text of the message, which is not displayed in the initial table. The second tab allows the user to display a selected message’s content and the calculated annotations.

85

Fig.3. User Interface; Selecting a Message

The application annotates each sentence with as many types of subsets from the categories defined in the Community of Inquiry model, but only one type of cognitive presence. This is possible due to complex search methods designed to find specific patterns consisting of not only certain words but also their synonyms or parts of speech. For a better work with words and sentences, since it is based on pattern recognition, the application uses WordNet. WordNet is a lexical database for the English language, developed at the Cognitive Science Laboratory of Princeton University. WordNet distinguishes between words as literally appearing in texts and the actual senses of the words. A set of words that share one sense is called a synset. Thus, each synset identifies one sense. Words with multiple meanings belong to multiple synsets. WordNet provides relations between synsets such as hyponymy or homonymy. Conceptually, the hyponymy relation in WordNet spans a directed acyclic graph (DAG) with a single root node called entity. Semantic similarity based on WordNet has been widely explored in Natural Language Processing and Information Retrieval. These methods can be classified into three categories [11]: • Edge-based methods: to measure the semantic similarity between two words is to measure the distance (the path linking) of the words and the position of the word in the taxonomy. • Information-based statistics methods • Hybrid methods: combine the above methods

86

Fig.4. User Interface; Highlighted Sentence with Cognitive Presence

4.1

The rules

The rule-based system classifies each sentence into one subclass of the social, cognitive or teaching presences, and therefore to the main class, according to rules of the following model: IF pattern_match_to_subclass() THEN assign_to subclass meaning that if the sentence contains a part that matches to a given pattern, then the sentence is assigned to the subclass the pattern represents. 4.2

The Patterns

The patterns used to assign sentences to classes were created according to the description from sections 2.1, 2.2 and 2.3. A pattern is formed mainly of words, or parts of speech for desired words, synonyms of certain words, etc. Therefore, WordNet is required to identify such patterns. Those patterns can consist of alphanumeric characters and place holders for one or more unknown words. Such patterns are: 1. Salutation : "hello" | "hi" | good (morning | evening | afternoon | bye) | "bye" 2. Express opinion : I ( | ) (agree | feel | believe | think | reckon) | "in my opinion" 3. The pattern matches on text that contains “I” followed by a verb or determinant, then by one of the listed verbs, or text that contains the string “in my opinion”. The pattern can have the following form as well: 4. I ( | ) ( | “agree”) | … 5. Ask about opinion: What * you * think *? 6. Etc. Where: matches the desired part of speech, matches the words having the particular stem, * stands for any word, marks any synonym of

87

the particular word, | means “or” and it tests if any of the expressions can be found, in the text, & means “and” and it verifies weather all expressions can match the text.

Fig.5. User interface. Selected Cognitive Presence

The application is given a set of patterns for each subtype of presence and it assigns it to sentences that mach to the pattern. Therefore, in the end, most relevant sentences are annotated. A forum message is said to correspond to one type of presence if it contains at least one sentence annotated with one subtype of this presence. True educational experience is considered to be achieved in the messages that were annotated to all three presences: social, cognitive and teaching. 5

Results

The application was meant to be tested on 300 forum messages from students of the School of Medicine of the University of Manchester. They discussed topics like patient safety, the role of medical students in hospitals, relations with doctors, nurses and patients, etc. The messages were all annotated before the tests with subtypes of the presences from section 2. The example in Fig.6 shows a message with sentences that matched all types of presences. The person expresses his opinion, observations, also encouraging a further discussion on a related topic, proving that he is fully participating in the conversation.

88

Fig.6. Educational Experience 5.1

Accuracy

The application was tested with two different definitions of correctness. First, a message was considered annotated correctly if all the subcategories present in the experts’ annotations were identified in the exact number. The differences are due to the fact that some experts annotated paragraphs and not sentences, as the application. Secondly, only the type of presence (or sub-presence) was considered, while the number of identifications was not taken into account. The results of the second test were close to 90% precision because of the complex patterns defined. The smaller number of correct annotated messages, from the teaching presence point of view, is due to ambiguous patterns.

(""# '"# &"# 812:# (#

%"# $"# "# )*+,-.#/01213+1#

4*53,671# /01213+1#

81-+9,35# /01213+1#

Fig.7. Test Results

The application can classify any dataset of forum messages of any types since neither the patterns, nor the analysis depend on the messages’ content. Moreover, the application should only run once for each dataset since the results are constant.

89

5.2

Efficiency

For a better analysis of the situation and the efficiency of the application, several medical students tried to annotate the entire corpus. Having a description of each type and subtype of presence, they were required to annotate each sentence (if possible). A normal forum message of not more than 150 words was annotated in 5 minutes per person in the last 30 minutes of the analysis, but the duration increased in time, since the tested tutors began to get tired, and so did the accuracy of the annotations. Moreover, after a couple of hours, they asked for a break, so the entire process lasted for two days. However, the application annotates any message in seconds (or milliseconds for some examples) and never gets tired, no matter of the situation. Therefore, in not more than four hours the entire analysis is done, and the computer is ready to start over. 6

Conclusion

The results obtained from the system allow us to conclude that the evaluation of students’ participation in a forum discussion can be achieved automatically. Also, besides being as accurate as in the case of human analysis, it is much more efficient, in time. Moreover, messages annotated with all categories: social, cognitive and teaching do mark students’ contributions to the forum in which a true learning experience is experimented, because those participate the most to the communication, not only by number or length of the messages. A person cannot continuously annotate more than fifteen messages and after a few hours of work his performances would decrease. In addition to the manual evaluation, the automatic one is always subjective and the computer is never tired, no matter how many messages are annotated. The presented application can be improved with more advanced patterns, but too many patterns could increase the level of ambiguity in classifying the information. Furthermore, a set of patterns can only be applied to specific domains and languages, and that is why the solution cannot be applied to a more general problem. It is recommended that a set of patterns to be as minimal as possible, since the analysis of messages takes a lot of time, so a multilingual set of patterns can only aggravate the problem. Moreover, patterns might tend to overfit on the test data. Therefore, the application can be improved if combined with machine learning algorithms to automatically generate patterns for more accurate results. References 1. 2.

3. 4.

Turing, A. M. Computing machinery and intelligence, Mind, 59, p. 433-460 (1950) Winograd, T. Thinking machines: Can there be? Are we? Report No. STAN-CS-871161, Stanford, p167-189, c. Chrisley, R. 2000. Artificial Intelligence: critical concepts.3. London: Routledge.(1987) Akers, R. Web Discussion forums in Teaching and Learning. Case Study, University of North Carolina.( 1997) Vygotsky, L., Thought and language. 1. Cambridge: MIT Press. (1962)

90 5.

Moore, M. Three types of interaction. The American Journal of Distance Education, 3(2). ( 1989) 6. Garrison, D. R., Anderson, T., Archer, W. Critical inquiry in a text-based environment: Computer conferencing in higher education. Internet and Higher Education, 2(2-3), p.87 105. (2000) 7. Garrison, D. R., Anderson, T. E-Learning in the 21stCentury: A Framework for Research and Practice. 1. London: RoutledgeFalmer. (2003) 8. Anderson, T., Rourke, L., Garrison, D. R., Archer, W., Assessing teaching presence in a computer conferencing environment, Journal of Asynchronous Learning Networks ,5(2). ( 2001) 9. RightNow Web, Administration Manual v.4.0 (2000) 10. Bratitsis, T. Dimitracopoulou, A. Monitoring and Analyzing Group Interactions in asynchronous discussions with the DIAS system, In the 12th International Workshop on Groupware, GRIW2006, GroupWare: Design, Implementation and Use, p.54-61.( 2006) 11. Gangemi, A. , Navigli, R., Velardi, P. The OntoWordNet Project: Extension and Axiomatization of Conceptual Relations in WordNet,( 2003) http://www.loacnr.it/Papers/ODBASE-WORDNET

91

A Knowledge-based Approach to Entity Identification and Classification in Natural Language Text Stefan Dumitrescu1, Stefan Trausan-Matu1, Mihaela Brut2, Florence Sedes2 1

Politehnica University of Bucharest, Computer Science Department 313 Splaiul Independentei, Bucharest, Romania 2 IRIT – Research Institut in Computer Science of Toulouse, 118 Route de Narbonne, 31062 Toulouse, France [email protected], [email protected], [email protected], [email protected]

Abstract. We present a system for entity identification in free text. Each entity extracted will be identified as an entity from the system’s knowledge base – an ontology. We propose an algorithm that detects in a single pass the most probable related entity assignations instead of individually checking every possible entity combination. This unsupervised system employs graph algorithms applied on a graph extracted from the ontology as well as implementing a entity path scoring function to provide the most probable related entity assignations. We present the system’s implementation and results, as well as its strong and weak-points. Keywords: entity identification, unsupervised, knowledge based, graph algorithm, ontology

1

Introduction

The Web is currently the most used information source world-wide. New content is added every day, in ever increasing amounts. However, the vast majority of this content is added in an unstructured manner. Current search engines build ever larger indexes of websites to allow access to this content. But current Information Retrieval methods can only go so far, and new methods to quickly obtain required information are more and more desired. The Semantic Web promises relevant information delivered fast, and in the format the user desires. This means that computers need to ‘understand’ to some degree the information they store and process. The field of Information Extraction (IE) takes on the task of extracting information from existing sources, be they unstructured (free text, books, news articles), semi-structured (XML, structured web pages like Wikipedia) or structured sources (databases) and then translating this information in a computer understandable structured format that the machine can process. One basic form to store this information so that it can be easily processed by the computer is in the form of simple entity-relation tuples (subjectpredicate-object). As such, some of main tasks in IE are entity and relation detection and identification.

92

This paper presents one approach to a sub-task of IE, namely identification and correct assignation of predefined classes to entities found in free text. We present a fully unsupervised, knowledge-rich system that, given natural text as input will extract relevant entities from it (both common and named entities) and will assign to each extracted entity a class found in an ontology. From a certain point of view it can be viewed as a fine-grained, partially targeted word sense disambiguation (WSD) problem [1], or even partially as the WSD sub-problem of word sense discrimination [2]. The entire system is driven by the idea that entities are defined by their context. A single entity can mean multiple things, but when put in context its meaning becomes clear. Context in this case means some form of directional logical link from an entity to another, as each entity specifies every other to some degree. For example, when saying “the bank in Paris”, bank defines Paris, and Paris defines bank. Individually, “Paris” could mean the capital of France, the singer or even the historic figure, and “bank” could mean a monetary institution, a school of fish, a flight maneuver or the side of a river. Put together, their meaning becomes clearer. Paris can no longer be a person, and bank can no longer be a flight maneuver. Adding another entity as “accounts opened at the bank in Paris” will then clearly specify every entity, including bank which represents a monetary institution and not the bank of a river possibly named Paris, even without looking at the words linking them such as verbs, prepositions, etc. We rely on the fact that in order to connect entities together, this information has to already exist in some form of knowledge repository. Ontologies match our desired repository structure, as an ontology is at its core a type of graph that interconnects entities. In the system presented in this paper we will use a generic, large scale ontology that encompasses both named and common entities. 2

Problem definition and proposed solution

Given free text in the form of sentences written in natural language, we aim to detect relevant entities and then identify them to matching classes in an ontology. For example, for the sentence “Einstein’s theories are discussed by Kaku in his latest book.” we would like to detect that Einstein is an entity and that it refers to Albert Einstein, Kaku is also an entity and it refers to physicist Michio Kaku, and also that the common nouns “theories” and “book” are identified as a scientific theory or at least a general theory and a literature book respectively. We will attempt to do so using graph algorithms applied on an ontology which represents our knowledge source. In this article we shall call the entities extracted from sentences “string entities” because they are just that, bounded sequences of characters without meaning for a computer, and entities from the ontology as “canonic entities” as they are clearly defined entities linked between them to create useful knowledge structures. As such, we present a system that, given a document, will identify and extract entities and associate a canonic entity from the ontology to each string entity extracted from the text. Let us consider a text document D composed of sentences written in natural language. We first identify and extract string entities SEk from the sentences

93

composing D. In other words we can see D as the set of the n string entities found within: ! ! !"! !!! ! !! ! !

(1)

where n is the number of string entities found in D. Next, we assign to each string entity SEk a list of canonic entities CE that each could actually represent SEk. For example we assign to string entity SE1 “Einstein” canonic entity CE1 Albert_Einstein, but also CE2 Hermann_Einstein, his father. Both are valid possibilities for “Einstein” as a first name is not specified. We define PCESEk the set of probable canonic entities CE assigned to a string entity SEk: !"#!"! ! ! !"! !!! ! !! !!

(2)

where mk is the number of canonic entities identified for string entity SEk. For example, PCESE1 from the above example would be: PCESE1 = {CE1, CE2}. After processing, the algorithm will output an array of result items RIi: !"! ! !"! !!"! ! !"#!"! ! ! ! !! !!!"!!! ! !! !

(3)

meaning that each result item RIi contains exactly n canonic entities CEk, each one belonging either to its probable canonic entity set PCESEk assigned to string entity SEk or being unknown (defined as “! ”). An unknown value for a string entity can be interpreted either as any entity in its PCESEk, or a new entity as we allow for the possibility of new, unknown entities. To clarify how the algorithm’s results might look like, we take the following basic example: Document D is composed of only one sentence “Einstein visited Ulm.”. We extract two string entities SE1 = “Einstein” and SE2 = “Ulm”. For SE1 we find probable canonic entities set PCESE1 = {Albert_Einstein, Hermann_Einstein} that could represent SE1, and for SE2 we find PCESE2 = {Ulm, Ulm_Montana}. We expect the algorithm to output the best guess given the information available of what string entities “Einstein” and “Ulm” each stand for out of their respective probable canonic entities set. One result item that will score the very high will be RI1 = {Albert_Einstein, Ulm} because Albert Einstein was actually born in Ulm in Germany. However, a plausible result item can also be RI2 = {Herman_Einstein, Ulm_Montana} because we have no information (in YAGO ontology, our current knowledge source) whether or not Einstein’s father ever reached the city of Ulm in Montana, USA. Another plausible result item with the same score as RI2 might be RI3 = {unknown_entity, Ulm_Montana} in case we are actually speaking about a person named Einstein who actually visited the Ulm in Montana and should be treated as a new entity for which we have no information. We allow for unknown entities because otherwise we severely restrict the system to detect only known entities on which we already have information about. Having defined our task, we will sketch out the main idea of the system before presenting it in detail in section 3. The input of our system is a natural language document. In the system’s first module, we take the document, split it into sentences and further into individual tokens, we assign part-of-speech tags to each token, and dependency trees to each sentence. This step also identifies our string entities in the form of common and

94

proper nouns. All this information goes into the second module for string entity processing. Here two things happen. First, we assign to each string entity a set of probable canonic entities. These canonic entities are extracted from our knowledge source. Thus, at this stage, each string entity could be represented by any one of the canonic entities found its assigned set. Secondly, based on each sentence’s dependency tree, an influence matrix is created. This non-symmetrical matrix contains real values quantifying the influence of each string entity on every other. All the string entities with their assigned canonic entity sets and the computed influence matrix enter in the third and final processing module. Before actual processing, a graph is being created containing all the entities found in all the probable canonic entity sets, as well as every neighbor they have, up to a certain distance. This is basically a sub-graph extracted from the ontology. Then, a set of paths between canonic entities is determined based on this extracted graph. Each path between two entities is given a score, influenced by the links forming the path and the influence matrix. The final step involves finding the highest scoring and largest set of entities linked by a series of such paths. The output is actually a canonic class for every string entity identified. One useful feature of the system is that all results come with a semantic justification graph composed of the path between entities (including relation types and intermediate entities). This justification graph can be used for further result evaluation, manual or automatic, similar to [5][6]. Our purpose could be interpreted as a fine-grained all-word disambiguation (WSD) problem, similar to some of the tasks presented in past MUC/SensEval1 challenges. From a certain point of view, we aim at exactly that: given a text and a knowledge source, assign to each word a class from the knowledge source. However, there are differences, for example while we look at both named entities and common entities (closer to targeted WSD), we do not take into account verbs or other modifiers, and focus only on nouns (named or common). Also, we use a generic ontology that contains millions of possible entities to choose from instead of a small, restricted set. Furthermore, the aim of the system is for its results to be further used in conjunction with other methods or systems (ex: relation detection, machine learning methods) to provide, for example, full ontological facts. This is even more relevant as we allow for unknown entities to exist in valid result items, thus allowing them to be used in new extracted facts to gather information about previously unknown entities. We shortly review some of the closest methods and techniques our system is related to in the area of knowledge-based methods for WSD [2, 3]. One approach is to determine the overlap of sense definitions. Also known as the Lesk algorithm [9], the similarity between a pair of words is calculated as the highest overlap of words definitions. In some sense it is related to our algorithm problem, as the number of words increases linearly, the computational problem increases exponentially. Another approach is by selectional preferences. These are constraints on the type of words that can stand next to another. Word to word measures are computed using frequency count on a corpus or other methods[10]. For other, more interesting word to class or class to class (a problem we actually face when evaluating results), large corpora are parsed and frequency together with words’ semantic classes 1

http://www-nlpir.nist.gov/related_projects/muc , http://www.senseval.org

95

provide a way to select a preferred class or word. This approach however yields poorer results than Lesk’s algorithm [11] but is interesting for its class to class selection feature that could be applied as a final result selector for our system. Another category is structural approaches, divided into similarity and graph-based methods. Similarity methods propose methods to assign a score to different words based on the structure of the graph, for example measuring WordNet hypernym edge distances between words (not much unlike our own scoring method) [12]. Many other metrics have been proposed, including distance and information content based metrics. The second category of graph-based methods exploits the structure of graphs itself. However, most of these approaches [14, 15] focus on lexical chains (structures of semantically related words), an approach different from ours. 3

Implementation

We will present the system architecture in a top-down manner. As such, we have the following top level modules, and the direction flow of information: Input: document (sentence set)

NLP pipe

String Entity processing

Algorithm processing

Output: canonic entity list

Ontology

Fig.1. The system’s major modules

The input of the system is a set of sentences (we will call this set of sentences a document). The sentences go in the NLP pipe where they are preprocessed, the extracted string entity sets are then processed and potential canonic entities are attributed to each string entity, then finally our algorithm is applied on them. The output is an array of sets of canonic entities, where each canonic entity corresponds to one string entity found in each sentence. We further present each module in more detail, starting with the ontology: 3.1

Ontology

For the purpose of this paper we will use the YAGO ontology [13]. YAGO is a general, light-weight and extensible ontology with high coverage and quality. YAGO was built as a very large, accurate (95+ accuracy) and simple to use ontology for machines including WordNet entities and hierarchy, and information extracted from Wikipedia like named entities (people, organizations, geographic locations, books, songs, products, etc.), and also relations among these entities. In the YAGO model all objects are entities and any two such entities can stand in a relation. YAGO uses two sources of information: WordNet and Wikipedia. From WordNet it borrows the hypernym hierarchy, while from Wikipedia it borrows entities and uses them as arguments to the relations implemented in YAGO. Thus, WordNet defines the upper

96

hierarchy, while Wikipedia contributes to the lower, most specific branches. In summary, YAGO stores more than 2 million entities with 20 million facts about them. 3.2

NLP Pipe Module

The NLP pipe is the standard treatment applied to sentences: First, the document is split into sentences. This is done by OpenNLP’s Maximum Entropy (ME) based sentence splitter2. This ensures that sentences like “The book is written by J.R.R. Tolkein.” are correctly identified and not split whenever encountering a comma or other punctuation. Each sentence is then split into tokens by OpenNLP’s tokenizer (also based on a ME model). Next each token is further processed. Each token contains the original string, its stem (for verbs), its inflected form (for nouns), whether it is a common word, a nominal noun or punctuation mark. Also, the sentence is parsed using Stanford’s parser, and we store the syntactic tree and the dependency tree. Multi-token string entities are found using OpenNLP’s Entity Detector (another ME model detecting persons, locations and organizations). This allows us to join neighboring proper nouns together in a single string entity. All number followed by units of measure are automatically joined in a single string entity. 3.3

String Entity Processing Module

This module takes as input the processed sentences from the NLP pipe. It performs two relatively distinct operations on it. First, to each string entity it retrieves a set of appropriate canonic entities from the ontology. Secondly, it computes an influence matrix that measures the influence of each string entity on the other based on each sentence’s dependency tree. 3.3.1 Canonic Entity Assigning Each sentence comes out of the NLP pipe with a set of identified string entities. For each of these entities we must assign a set of possible canonic entities that the string entity might represent. We obtain these canonic entities from the ontology we use. Using YAGO’s means relation, we can link strings to ontology classes (canonic entities). For example, for the string entity “engine” we ask YAGO to give us all the entities that contain the substring “engine”. In this particular case, YAGO returns 2259 facts, in the form presented in table 1. We search for substrings because each ontology class has at least one means relation, meaning that there are more ways to define a single class. For example, canonic entity “United_States” is referenced by “U.S”, “Stati Uniti d'America“ or “American Civilization“ (to name a few) through the means relation. Also the reverse holds true, meaning that we can define with the same word more YAGO classes.

2

OpenNLP Tools found at http://sourceforge.net/projects/opennlp

97 Table 1. Example of classes YAGO returns for the query about “engine” Subject (string) "Alfa Romeo Flat-4 engine" "automobile engine" "diesel engine" "engine" "engine" "engine" "engine"

Relation means means means means means means means

Object (YAGO class) Alfa_Romeo_Flat-4_engine wordnet_automobile_engine_102761557 wordnet_diesel_103193107 wordnet_engine_103287733 wordnet_engine_103288003 wordnet_engine_111417561 wordnet_locomotive_103684823

The word “engine” can mean either a motor engine or a locomotive with equal probability. Searching for substrings means we do not miss any probable canonic entity. Having obtained for each string entity a set of possible canonic entities, we now perform a cleaning step for each string entity to reduce this set from possible to probable canonic entities. Using custom heuristics, such as it ensures that all the individual words of the string entity are found in the canonic entity’s name we remove a large number of false positive entities. Common words can only be represented by WordNet classes, and named entities only by lower level, nonWordNet classes. For the above example, we drop canonic class Alfa_Romeo_Flat4_engine because it is an individual and not a generic entity. Even after this cleaning step, we are usually still left with many probable classes. For example, for string entity “California” we clean 90% of the more than 12000 possible classes and we are still left with around 1000 probable classes that each could be representative. 3.3.2 Influence Matrix Computing The NLP pipe provides a set of string entities but it does not provide a measure of influence of an entity to another that we need to take into account in the processing module. Thus, given the set of string entities and a sentence’s dependency tree we need to compute a matrix of string entity – string entity influence. For example, for the sentence “The new Hyundai Accent is equipped with a 1.6 liter engine delivering 110 hp.” the NLP pipe provides us with string entities: “Hyundai Accent”, “1.6 liter”, “engine” and “110 hp”. For each string entity we traverse the tree looking for connections to other string entities. For our example sentence, we discover the following: computeInfluence for : Hyundai Accent-4 - 1.6 liter-10 Value:1.0 subject relation but with proxy engine-11 computeInfluence for : Hyundai Accent-4 - engine-11 Value:1.0 subject relation. computeInfluence for : engine-11 - 1.6 liter-10 Value:1.0 direct link for nn(engine-11,liter-10) computeInfluence for : engine-11 - 110 hp-14 Value:0.5 proxy for engine 110 hp using delivering-12

proxy

We consider for each string entity a head word to use as a match in the dependency tree (for multi-word tokens like “Hyundai Accent” we will use Accent-4 as the

98

representative for our string entity and hp-14 for “110 hp” – the number after the dash is the word’s position in the sentence as words can be repeated in the same sentence with different meanings and influences). We can assign three influence values between entities. The 1.0 value means strong connection, 0.5 means somewhat connected, and 0.1 means no link, but used for context. We obtain the following matrix: Influence Matrix: Hyundai Accent: 1.6 liter : engine : 110 hp :

Hyund --0.1 0.1 0.1

1.6 l 1.0 --1.0 0.1

engin 1.0 0.1 --0.1

110 h 0.1 0.1 0.5 ---

A note worth mentioning is that the system’s performance is strongly connected to this influence matrix. At times, the dependency tree fails to generate correct dependencies and thus the final results will be rather poor due to missed links between entities. However, we assume that parser we use (Stanford’s parser) is among the best currently available. The dependency tree generation, like the syntactic tree generation is a hard problem in itself. 3.4

Algorithm Processing Module

The String Entity Processing module provides sentences with delimited string entities, each entity have associated a set of probable canonic entities, as well as an influence matrix between the string entities themselves. Based on this data, the Algorithm Processing module will provide sets of canonic entities sorted by descending scores. However, there are two steps that need to be performed prior to applying our algorithm. First we need to create the graph on which to run the algorithm, and then we need to split the entire string array entities into smaller sets that can and should be processed independently. Next, we apply the algorithm on each Process Set. The first step is to identify relevant links on the graph concerning the entities involved in the current Process Set, and then to extract from these links the best scoring result items. 3.4.1 Graph initialization The graph is the structure on which the algorithm will be run upon, so it is the first to be created. The graph contains all the probable canonic entity from every string entity as well as YAGO’s top level WordNet hypernym tree. In essence, the graph created here is a sub-graph of YAGO itself. We need to create this graph and not directly use YAGO due to the following reasons: First, The YAGO ontology is too large to store into memory for our current available machines. Secondly, we prune unneeded links from the graph itself. Not only does this increase performance because the search space is drastically reduced, but it is needed because of the way YAGO stores facts about entities. The nature of the proposed algorithm requests that we use

99

classes that are most specific. So many links that represent more general relations are dropped. The first step in graph creation is the addition of YAGO’s entire top-lever hypernym tree. This will generate a graph that contains around 65000 entities and 73000 relations between them. The second step is to add all the canonic entities assigned to every string entity that is a named entity. String entities that are common words have already been included in the first step – they can only have WordNet entities. So, for each named string entity, every canonic entity is added to the graph. However, we add not only the canonic entity itself, but other canonic entities from YAGO that have connections to the original entity. We find these related entities and links by performing a breadthfirst (BF) search (limited depth of 3) on the ontology starting from the initial entity. Thus, for each canonic entity belonging to a string entity we add a sub-graph starting from the canonic entity. We link each entity to the most specific WordNet class. A graph is created after these two steps, containing all our canonic entities and possible entities that may link them, as well as the complete WordNet tree to which every entity must have at least one link to. We define this graph as G = where CEset is the set of canonic entities in G, and Lset is the set of links between the canonic entities. G is a directed graph. 3.4.2 Process Set Creation The Process Set creation is a needed step before the algorithm itself is run. A document can contain many sentences, and each sentence can contain many string entities. For example, a normal Wikipedia page can have around 500 identified string entities and a longer page even more than 2000. The proposed algorithm provides solution items that are as long as the original string entity set. For this reason, if we were to give the algorithm an input set of 2000 entities, processing will take very long and most of it would be wasted on entities that are not connected (influence 0.0 in the influence matrix). Thus, we need to find the smallest independent sets of string entities on which to apply the algorithm sequentially (or even in parallel). This is achieved by performing the flood-fill algorithm on the influence matrix. First we create a copy of the influence matrix where every value that is non-zero is replaced with a 1.0. We find the first non-zero element Infij (which in the matrix means that entity in row i influences entity in column j) and start zeroing any element that it influences or is being influenced by while in the mean time adding these elements to a new process set. The flood-fill is a breadth-first graph search on the matrix. We repeat the entire process until the entire matrix is zeroed out and we have obtained all the independent sets of string entities. 3.4.3 Algorithm Processing This sub-module is applied to each Process Set independently. The input here is a Process Set containing string entities that each has a list of probable canonic entities associated, and the custom sub-graph created from the YAGO ontology itself on which to run the algorithm.

100

We define the following notations: The Process Set has n string enities SE. Each string entity SEk has associated a set PCESEk of canonic entities CEj, where j ! (1, mk), where mk is the number of canonic entities in PCESEk. From the previous sectzion we have directed graph G = . We also define tuple array LA = {(CEa, CEb, scoreab) | CEa ! PCESEa, CEb ! PCESEb, a!b} where CEa and CEb each belong to a different probable canonic entity set forming the end-points of a path with real value score scoreab. We consider the link between any CEa and CEb in LA as undirected as it represents a score given to a path between these two entities in G."#$% each canonic entity CEj in the set PCESEk of each string entity SEk we perform a BF search with maximum depth 3. If we encounter a canonic entity CEq during the BF search that belongs to the set of another string entity, then we have found a path between these entities to which we assign a score of sqj. If a path between CEq and CEj already exists in LA we increase its current score by sqj, else we add the new tuple to LA with initial score of sqj. The function score(q,j) assigns a score for the path between any two entities CEq and CEj. The value it outputs is a heuristically constructed function: ! !"#$% !! ! ! ! ! !"#$%&"'& !"! ! !"! !"#$ !"! ! !"! (4) !!"#!!!"! ! !"#!"! ! !"! ! !"#!"! ! meaning we take the inverse of the distance in graph G between CEq and CEj and multiply it by the influence between string entities SEa and SEb to which CEq and CEj belong respectively. The influence of SEa and SEb is determined from the influence matrix of string entities which is not a symmetrical matrix, so score(q,j) ) score(j,q). For this reason, if a path is found in G between CEq and CEj, also a path could be found between CEj and CEq but with a different score. Because we are interested in the next part of the algorithm only if two entities are connected and with what score, we use LA as an array of tuples indicating scores between entities. LA can also be viewed as an undirected weighted graph G` because it holds tuples that can represent weighted edges between two vertices. The next step of the algorithm is performed on graph G`, where for every vertex it starts a recursive Depth-First (DF) search. The DFS function is presented next: function DFS(CEv) { push CEv to path; for every neighbor CEn of CEv: { if CEn visited OR CEn does not belong to a unique PCESE continue to next neighbor; else DFS(CEn); } if CEv has no valid neighbors addSolution(path); pop CEv from path; }

DFS() is a recursive DF search, storing the path from the initial vertex on each recurrent call, and stopping to add a new solution only when no neighbor vertices can

101

be further added to the path. A new vertex can be added only if this vertex is not already present in the path and the vertex (which is a canonic entity CE) belongs to a probable canonic entity set PCESE that no other vertex on the path belongs to. This ensures that we only add solutions that contain one canonic entity for every string entity found. To explain the addSolution() function we must now define a result item and a result item list. A result item list RIL is an array that holds result items. A result item RI is an array of length n (where n is the total number of string entities analyzed in this Process Set) that holds in each position a candidate canonic entity CE or a void value. The void value can be interpreted as “any canonic entity”. For example let’s consider four string entities SEA, SEB, SEC, SED, each with its associated PCESE. Let’s assume that for SEA, PCESEA is {A1, A2}, for SEB, PCESEB is {B1, B2} and similar for SEC and SED. A result item RI will have in this case length 4, because we have 4 string entities. On each position, RI must necessarily have either a void value or a canonic entity belonging to that respective PCESE. For example, a void RI can be visualized as {!,!,!,} (where “!” is interpreted as “any” canonic entity), a completely filled RI as {A1, B2, C1, D1} and a partially filled RI as {A1, !, !, D2}. The purpose of addSolution() is to add new, increasingly larger result items to the solutions list. The function will initially create an empty result item, populate it with entities found in DFS’s path, and then check every other result item in our result item list to see if the current result item is either a subset or an equal. Thus, we only allow larger sets to be added. To exemplify, if RI1 = {A1, !, C2, D1} and RI2 = {A1, !, !, D1}, then RI2 is a subset of RI1. After DFS() was run on every CE in G`, we have a result item list that contains different sized sets of solutions. The next step is to assign a score to each solution, as until now every result item has a zero score. The computeScores() function takes every link in G` and then searches for that link in every solution existing in the result item list. If a result item RI contains a link, then RI’s score is increased by the weight of that link. The final step of the algorithm is the merging of distinct result items. Function mergeResultSetList() will initially create an empty RIL` to hold merged results. Then it will iteratively take every result item RI in RIL, and look for following result items RIj. If any RIj is a distinct set from RI, then merge them together and add the result to RIL`. DFS() will create the largest solution possible among sets that are connected. However, distinct result items will not be touched. Using mergeResultSetList() is the final step needed to create the highest scoring, most-complete result items possible based on the available information needed. 4

Evaluation

We have run the system on a standard 2.0 GHz PC with 4GB of RAM. RAM is only needed to hold the graph in main memory on which the algorithm will run. On normal documents of about 200 sentences with around 600 entities we used around 1GB of RAM. That may seem much but a graph this size contains around 90.000 canonic entities and links between them. The algorithm’s use of RAM is negligible. Due to the splitting of the string entities into subsets that are processed individually, the

102

algorithm computationally performs very well, because usually in a single set there are no more than 4-10 entities, a number for which processing is almost instant, even though there usually are anywhere from a few tens to a few thousand of probable entities for each string entity in the set. Even better, due to the independent nature of process sets, they can be run in parallel without any algorithm modification. The entire system has been built in Java, using the YAGO ontology as a PostgreSQL database stored on the hard drive. Actually, the slowest part of the system is the graph creation, because of the repeated queries to PostgreSQL needed to build the graph. This accounts to more than 90% of entire system run-time. Evaluation of the system’s results is a somewhat difficult task. We have found no other systems to compare ours with, because of our particular setting. We cannot apply the system to reference test corpora like the ACE 2003/2004 or other similar Sens/SemEval corpora because we rely on a large generic ontology and not on a subset of entities, and we handle both named and common entities. Also, we cannot restrict our working entity set because the system is working better the larger the entity set is and the connections number within it. Our knowledge base is actually our entire result space. The larger the number of entities and relations, the larger the number of resulting assigned classes. However, we can manually create a set of tests and results and measure our system’s performance against these tests. We will measure the accuracy of assigned canonic entities. For example, for the sentence “Smith was born in Farmersville, a small town in California.” we extract string entities (“Smith”, “Farmersville”, “town”, “California”). For the result item (ANY, Farmersville,_California, wordnet_town, California) we assign a 4/4 (100% accuracy) score because it correctly identified that Smith could be any person (meaning that either YAGO does not contain any possible canonic entities for “Smith” or more likely that no links have been found between any canonic entities representing Smith to any other entities), that “Farmersville” is correctly identified by canonic entity Farmersville,_California, “town” for wordnet_town and “California” for California. If for example instead of ANY it would have been canonic entity John_Smith, then accuracy would have dropped to 3/4 (75%), because even if there is some long, improbable, low scoring path between John Smith and Farmersville, such as John_Smith bornIn (!) San_Francisco type (!) city type (") Farmersville,_Califonia, for a human there is no logical link, because we know (or agree by general consensus) that no generic John Smith was actually born in a small town in California named Farmersville. We thus evaluate the system against human judgment on which canonic entities should correctly represent string entities. Accuracy is calculated as the number of correctly matched string entities divided by the total number of string entities. For evaluation we gathered a set of 50 varied length sentences, each with minimum 3 string entities, with a total of 236 extracted string entities in each sentence, evaluated and assigned an ontology class by hand. The system was run, and we evaluated the first result item for every processed sentence. We obtained an arguably average performance of around 26% for this strict evaluation method. Performance is affected because in many cases we run into one or both of the following issues: Issue 1: the system cannot yet discriminate between similar scoring result items with similar entity types. Given the sentence “Alan Mulally has just announced the

103

new Focus with a 1.6 liter engine.” with string entity set (“Alan Mulally”, ”Focus”, “1.6 liter”, “engine”) and the first two scoring result items: RI0: 2.0 (ANY, Ford_Focus_WRC, wordnet_liter, wordnet_automobile_engine) RI1: 2.0 (ANY, Ford_Focus, wordnet_liter, wordnet_automobile_engine)

As can be seen, the only difference between the two is that “Focus” could be either a Ford Focus WRC or a generic Ford Focus. Both entities are present in YAGO with the same type of links, and no information can differentiate one over the other. Because YAGO does not know that the WRC Focus is actually a modified type of standard Focus, then it will treat both entities as equal possible representatives for string entity “Focus”. Issue 2: the system cannot yet discriminate between general categories of entities. For example, we have the sentence: “Bucharest is the capital of Romania” with string entities (“Romania”, “Bucharest”, “capital”) we have the following two top scoring result items: RI0: 2.65 (Romania, Bucharest, wordnet_capital) RI1: 2.65 (Prince_Mircea_of_Romania, Bucharest, wordnet_capital)

As seen in the example, both result items have the same 2.65 score. For the system, both Romania and Prince_Mircea_of_Romania score similar because both are entities that can represent “Romania”, and string entity “capital” has no influence on string entity “Romania”. For humans the correct answer is obvious, but for the computer it is slightly more difficult: for Romania YAGO knows the fact that Bucharest is its capital, while for Prince_Mircea_of_Romania it knows that Bucharest is his birthplace. Because both facts are direct links between representative entities they are given equal top scores, and therefore the final score will be similar. A heuristic could be introduced to say that the system in similar score tie scenarios should give preference to exact matching words for example, but that will only cover some cases and create false positive results in others. While for issue 1 we have no immediate solution because we rely entirely on the availability of data in the ontology, for issue 2 we can assign a generic type to every string entity. We could build a relation detection system that could indicate to some point what type of relation exists between two entities. Given that in YAGO all relations types have their domain and range defined, we also assigned these types to each named entity. So we know that canonic entity Romania is a location and Prince_Mircea_of_Romania is a person. If we would also build such a relation detection module we would get hints about the type of entity expected in a relation, and thus correctly discriminate between equal result items as we would now know that a isCapitalOf relation would stand between a location and another location. We propose two more ‘forgiving’ evaluation methods, in which we relax allowed results by ignoring first issue 1 and then both issues, and we look to see if in any of the equal top-score result items we have correctly identified canonic entities. To exemplify this relaxation, if we take the Ford Focus example above, we would get for that sentence a (4/4) 100% accuracy, because even though the system’s default choice is RI0 which only evaluates to 3/4 (75%) accuracy, we inspect also RI1 because it has

104

the same top score, and we detect that RI1 actually provides a better 4/4 (100%) accuracy.

Table 2. Accuracy of system against a manually created standard Evaluation method Strict evaluation Evaluation w/o issue 1 Evaluation w/o issue 1&2

String Entity Accuracy 62 / 236 (26.2%) 97 / 236 (41.1%) 101 / 236 (42.7%)

While our initial results with this system are average, we can conclude on some points: First, the results depend heavily on the type and composition of sentences tested. For sentences with entities in areas of the ontology with higher information density, results are usually better, because of the increased link number and not necessarily because of the scoring function. This function is an important performance affecting factor: we have used a distance-based function, which is sensitive to information density fluctuation, a problem practically unavoidable in large general ontologies. Second, calculated accuracy depends even more on the human created standard to which results are evaluated against. But we can only evaluate the system on such a standard. The standard was created by three people reviewing possible classes extracted from YAGO manually and assigning them as correct answers to each string entity. Even so, misunderstandings have been rather common between the annotators because of the large number of similar correct classes. We estimate an ITA (interannotator agreement) of at most 65%. Similar results were obtained for fine-grained tasks, for example [7] reports an ITA on WordNet senses between 67% and 80%. Also, a standard baseline was very difficult to establish. Standard baselines like random-sense or first-sense are hard to implement because we implement both named and common entity identification, meaning we do not have a ‘first sense’ as we could have if evaluating only common nouns for example. Also, because of the number of seemingly good responses (especially for named entities) among a very large number of possible classes, a random baseline would yield uninformative results. For example, entity Hyundai could mean the auto company, the ship building company, or any of its 30+ car models, all being named entities. Though not comparable, for a general overview, SemEval 20073 yielded results in the 50%-60% performance for fine-grained tasks (with a maximum 10% above the baseline for the best system for 465 tagged words), underlining the current task’s difficulty. Third, context is highly important. For example for sentence "Hyundai has launched a new car named Santa Fe.", with string entities “Santa Fe”, “car”, “Hyundai”, we obtain result item (Hyundai_Santa_Fe, wordnet_car_102958343, Hyundai_Tucson) scoring 2/3 accuracy because the system thinks that “Hyundai” could mean Hyundai_Tucson which is a car, instead of Hyundai_Motor_Company. However, for sentence "Hyundai has launched the new Santa Fe." we obtain (Santa_Fe_Industries, Hyundai_Motor_Company), because of the conceptual 3

SemEval 2007 - http://nlp.cs.swarthmore.edu/semeval/index.php

105

link between Santa Fe Industries and Hyundai Motor Company as they are both industries, and missing the link to the auto vehicles because of insufficient evidence for Santa Fe being a car. Forth, the proposed algorithm efficiently makes the most of the information available to it. Where links are available, it finds all possible connections, evaluates them all in a single pass instead of processing an exponential number of entity combinations, and based on the scoring method, creates the result best sets given the available information. 5

Conclusions

In this paper we presented an unsupervised, knowledge-based system that presents a viable algorithm and encouraging first results for entity identification and correct class assignation from an ontology. We aim to show that ontologies can be used for more than just standard classification of the entities they contain, and that the structure itself of such large generic ontologies can be used to generate added value. Furthermore, we have presented an algorithm that provides fast results in a single pass for the current problem of evaluating the best combination of every possible entity assignation. Using a standard combinatorial approach where each entity would be tested against every other, the problem would grow exponential and thus unsolvable. As a conclusion we note the major issues that influence performance to a large degree: 1. Dependency tree generation. In most cases the tree is correctly generated, but it also happens that the tree misses or incorrectly assigns dependencies between words that lead to a poor starting point for the influence matrix creation. 2. Matrix creation rules. The matrix is generated by parsing the dependency tree. As rules are heuristically created, new rules or improved versions can be implemented. 3. Scoring function. Same as the matrix creation, the scoring function has been heuristically chosen. As with existing similarity measures for WordNet for example(reference), variations of the scoring function applied in the same algorithm can be created for improved system performance. While we used a distance-based scoring method, which by default suffers from unequal information density [1], it does provide a good performance and is applicable to both named and common entities, even though named entities are linked in a random graph of direct links while common entities are linked in a hypernym tree. For this reason a conceptualdensity [6] measure is somewhat risky to implement. 4. Knowledge source. The most important factor in the system’s performance by a large margin is the ontology used. For sentences where there is a large information amount about a subject, results will be surprisingly good, while lower information densities will yield poor results. As time passes and knowledge sources get richer, even without any change to the algorithm its performance will increase. Improving on any of these issues would cause an increase in performance. Knowledge-based systems usually have a somewhat poorer performance than full

106

supervised machine learning algorithms. However, they do benefit from a wider coverage due to the general, large knowledge sources they exploit [1]. Acknowledgments. The work has been funded by the Sectoral Operational Programme Human Resources Development 2007-2013 of the Romanian Ministry of Labour, Family and Social Protection through the Financial Agreement POSDRU/6/1.5/S/19. References 1. Roberto Navigli. 2009: Word sense disambiguation: A survey. ACM Comput. Surv. 41, 2, Article 10, 69 pages. (2009) 2. Manning, C., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. (1999) 3. Mihalcea, R.: Knowledge-based methods for WSD. In Word Sense Disambiguation: Algorithms and Applications, E. Agirre and P. Edmonds, Eds. Springer, New York, 107– 131. (2006) 4. Agirre, E., Rigau, G.. Word sense disambiguation using conceptual density. In Proceedings of the 16th International Conference on Computational Linguistics (COLING, Copenhagen, Denmark). 16–22. (1996) 5. Navigli, R.: Consistent validation of manual and automatic sense annotations with the aid of semantic graphs. Computational Ling. 32, 2, 273–281. (2006) 6. Navigli, R.: Experiments on the validation of sense annotations assisted by lexical chains. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL, Trento, Italy). 129–136. (2006) 7. Palmer, M., Dang, H., Fellbaum, C.: Making fine-grained and coarse-grained sense distinctions, both manually and automatically. J. Nat. Lang. Eng. 13, 2, 137–163. (2007) 8. Tratz, S., Sanfilipo, A., Gregory, M., Chappel, A., Posse, C., Whitney, P.: PNNL: A supervised maximum entropy approach to word sense disambiguation. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval, Prague, Czech Republic). 264–267. (2007) 9. Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th SIGDOC (New York, NY). 24–26. (1986) 10. Hindle, D. Rooth, M.: Structural ambiguity and lexical relations. Computat. Ling. 19, 1, 103–120. (1993) 11. Agirre, E. Martinez, D. : Learning class-to-class selectional preferences. In Proceedings of the 5th Conference on Computational Natural Language Learning (CoNLL, Toulouse, France). 15–22. (2001) 12. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybernet. 19, 1, 17–30. (1989) 13. Suchanek, F.-M., Kasneci, G., Weikum, G. Yago - A Core of Semantic Knowledge. In 16th international World Wide Web conference, (2007) 14. Navigli, R. Velardi, P.: Structural semantic interconnections: A knowledge-based approach to word sense disambiguation. IEEE Trans. Patt. Anal. Mach. Intell. 27, 7. (2005) 15. Mihalcea, R., Tarau, P., Figa, E. : Pagerank on semantic networks, with application to word sense disambiguation. In Proceedings of the 20th International Conference on Computational Linguistics (COLING, Geneva, Switzerland). 1126–1132. (2004)

107

Information Retrieval System for Similarity Semantic Search Laura DRAGOI1, Florin POP1 1

Faculty of Automatic Control and Computers, University POLITEHNICA of Bucharest, Splaiul Independentei 313, Bucharest 060042, Romania [email protected], [email protected]

Abstract. An important step in conducting a research project is studying the state of the art in the field of interest and finding research articles with a similar topic. There are numerous academic digital libraries and bibliographic databases online: ACM Digital Library, IEEE Xplore, CiteSeerx, etc. Navigating through these large collections is difficult and time consuming. In this paper we propose a system that will automatically find articles similar to a research topic, given its title and it’s abstract. The system will return a list of articles or projects, ranked by their similarity to the research topic. The results come from an index containing both research or diploma/master theses proposed by DSLab team (Distributed System Laboratory, University POLITEHNICA of Bucharest) and the DBLP Computer Science Bibliography. We experiment with different settings for the system and evaluate the results and also try to propose a solution for the cross-language retrieval problem (most of the projects proposed by our University are defined in Romanian while DBLP is in English). Keywords: Retrieval System, Similarity Semantic Search, Distributed Systems.

1

Introduction

A very large number of academic publications ranging from journal and newspaper articles to conference proceedings, reports, patents and books have been written, especially in the Computer Science field. The DBLP Computer Science Bibliography4, for example, lists over 1.5 million such publications at the beginning of January 2011. At the same time, the digital library and search engine CiteSeerx has indexed 750,000 full-text documents5, primarily focused on computer and information science and the Academic Search engine developed by Microsoft6 has 1,898,866 results in the Top publications in Computer Science list. According to Manning, Raghavan and Schütze [1], Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on 4

http://www.informatik.uni-trier.de/~ley/db/ http://citeseerx.ist.psu.edu/about/site 6 http://academic.research.microsoft.com/ 5

108

computers). IR can also cover other kinds of data and information problems beyond that specified in the core definition above. The term “unstructured data” refers to data which does not have clear, semantically overt, easy-for- a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database. The field of information retrieval also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. The interaction of a user with a retrieval system, according to Baeza-Yates and RibeiroNeto (2002) [2] is shown in Fig. 1.

Fig.1. Interaction of the user with the retrieval system through distinct tasks [2].

The proposed system should be easily extendable with more proposed projects from the DSLab. Also the system should allow for coupling more sources for academic publications in the future. The algorithms used for determining similar publications should be interchangeable and a mechanism for comparing the new results with the previous ones should be developed in order to evaluate their performance. In a retrieval task the user has an information need and has to translate it in a query that the IR system can understand, usually a set of words that convey the semantics of the needed data. If the user has an interest which is poorly defined or inherently broad, an interactive interface that allows the user to look around in the collection is better suited. This is a browsing task. The documents in the collection that makes the object of the retrieval system are frequently represented through a set of index terms or keywords. The index terms are obtained from the full-text of the documents by reducing the set of representative keywords. According to [2] the document preprocessing procedure can be divided mainly into five text operations: • Lexical analysis of the text with the objective of treating digits, hyphens, punctuation marks, and the case of letters. • Elimination of stop words with the objective of filtering out words with very low discrimination values for retrieval purposes. • Stemming of the remaining words with the objective of removing prefixes and suffixes and allowing the retrieval of documents containing syntactic variations of query terms (e.g., connect, connecting, connected, etc.). • Selection of index terms to determine which words/stems will be used as an indexing element. Usually, the decision on whether a particular word will be used as an index term is related to the syntactic nature of the word. In fact,

109

noun words frequently carry more semantics than adjectives, adverbs and verbs. • Construction of term categorization structures such as a thesaurus, or extraction of structure directly represented in the text, for allowing the expansion of the original query with the related terms. The paper addresses the problem of similarity semantic search based on enumerated text operations. The rest of the paper is structured as following: Section 2 presents the classical IR models and a short overview of existing solution in the field; Section 3 describe the performance analysis for proposed system; in Section 4 we describe the implementation details; the conclusions and future work are presented in Section 5. 2

Information Retrieval Systems

There are three classical models for information retrieval systems: Boolean model, vector model and probabilistic model. The Boolean model is based on set theory and Boolean algebra. The queries are specified as Boolean expressions which have precise semantics. The retrieval model is based on a binary decision: a document is either relevant or non-relevant for a query. An index term can be either present or absent from a document, there is no partial match. Simplicity is an advantage but the disadvantage is that due to the exact matching too few or too many results might be returned for a query. The vector model proposes a framework in which partial matching is possible. This is accomplished by assigning non-binary weights to the index terms which will be used to compute the degree of similarity between each document stored in the system and the user query. The result set is sorted in the decreasing order of the degree of similarity and is more precise than the answer set returned by the Boolean model. The probabilistic model attempts to capture the IR problem within a probabilistic framework. The basic idea is that given a user query, there is a set of documents which contains exactly the relevant documents and no other, the “ideal” answer set. The system starts by guessing an initial set of relevant answers and then takes into account the relevance feedback given by the user in order to improve the result set. The advantage is that documents are ranked in decreasing order of their probability of being relevant. The disadvantages are the need to guess the initial separation of documents into relevant and non-relevant, the fact that the method does not take into account the frequency with which an index term occurs in a document and the independence assumption for index terms. In [3] Aragon et al proposed shape-based image retrieval model. They focused on the shape feature of objects inside images because there is evidence that natural object are primarily recognized by their shapes. Using this feature of objects the semantic gap is reduced considerably. This technique contains an alternative representation of shapes that we have called two segment turning function (2STF). This is a new model adequate to the images. The new trends are oriented to mobile computing. In [4] is presented and analyze a new semantic-aware multimedia representation and accessing model in distributed and mobile database environments. Semantic classification and categorization of the

110

multimedia databases are based on the Summary Schemas Model. The ability of summarizing general information provides a promising mechanism to represent and access multimedia data entities. In this paper Yang proposes a logic-based model that can be integrated and used as the paradigm for text and multimedia data content representation. Augusto et al presents in [5] an information retrieval model to find information items with similar semantic content that a given user’s query. The information items internal representation is based on user interest groups, called "semantic cases". The model also defines a similarity measure for ordering the results based on semantic distance between semantic cases items. This is important and complex because exploring the metadata associated with documents in the Semantic Web is a way to increase the precision of information retrieval systems. These systems have been established so far failed to overcome fully the limitations of search based on keywords. They are built from variations of classic models that represent information by keywords and work upon statistical correlations. 3

Performance Analysis for Information Retrieval Systems

According to Manning [6], the standard approach to information retrieval system evaluation revolves around the notion of relevant and non-relevant documents. With respect to a user information need, a document in the test collection is given a binary classification as either relevant or non-relevant. This decision is referred to as the gold standard or ground truth judgment of relevance. The test document collection and suite of information needs have to be of a reasonable size: you need to average performance over fairly large test sets, as results are highly variable over different documents and information needs. As a rule of thumb, 50 information needs has usually been found to be a sufficient minimum. Relevance is assessed relative to an information need, not a query. We define several metrics for evaluating the performance of information retrieval system. Precision is the fraction of the documents retrieved that are relevant to the user's information need: precision =

{relevant documents}! {retrieved documents} {retrieved documents}

(2)

Precision takes all retrieved documents into account. It can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system. This measure is called precision at n or P@n. Recall is the fraction of the documents that are relevant to the query that are successfully retrieved: recall =

{relevant documents}! {retrieved documents} {relevant documents}

(3)

111

It can be looked at as the probability that a relevant document is retrieved by the query. It is trivial to achieve recall of 100% by returning all documents in response to any query. Therefore recall alone is not enough but one needs to measure the number of non-relevant documents also, for example by computing the precision. Fall-Out is the proportion of non-relevant documents that are retrieved, out of all non-relevant documents available: fall _ out =

{non_relevant documents}! {retrieved documents} {non_relevant documents}

(4)

It can be looked at as the probability that a non-relevant document is retrieved by the query. It is trivial to achieve fall-out of 0% by returning zero documents in response to any query. F-measure is the weighted harmonic mean of precision and recall: F=

2 ! precision ! recall precision + recall

(5)

This is also known as the F1 measure, because recall and precision are evenly weighted. The general formula for non-negative real ! is: F" =

(1 + " 2 ) ! precision ! recall " 2 ! ( precision + recall )

(6)

Two other commonly used F measures are the F2 measure, which weights recall twice as much as precision, and the F0.5 measure, which weights precision twice as much as recall. The F-measure was derived by van Rijsbergen (1979) so that F! measures the effectiveness of retrieval with respect to a user who attaches ! times as much importance to recall as precision. It is based on effectiveness measure: E =1"

1 1"! + P R

(7)

!

Their relationship is F" = 1 ! E where # = (" 2 + 1)!1 . Precision and recall are single-value metrics based on the whole list of documents returned by the system. For systems that return a ranked sequence of documents, it is desirable to also consider the order in which the returned documents are presented. Average precision emphasizes ranking relevant documents higher. It is the average of precisions computed at the point of each of the relevant documents in the ranked sequence: N

AveP =

! ( P(r ) " rel (r ))

(8)

r =1

relevant documents

where r is the rank, N the number retrieved, rel(r) a binary function on the relevance of a given rank, and P(r) precision at a given cut-off rank:

112

P(r ) =

relevant retrieved documents of rank r or less r

(9)

This metric is also sometimes referred to geometrically as the area under the Precision-Recall curve. Note that the denominator (number of relevant documents) is the number of relevant documents in the entire collection, so that the metric reflects performance over all relevant documents, regardless of a retrieval cutoff. Mean average precision for a set of queries is the mean of the average precision scores for each query (for a Q number of queries). Q

MAP =

! AveP(q)

(10)

q =1

Q

Discounted Cumulative Gain (DCG) uses a graded relevance scale of documents from the result set to evaluate the usefulness, or gain, of a document based on its position in the result list. The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result. The DCG accumulated at a particular rank position p is defined as: p

reli i = 2 log(i )

DCG p = rel1 + !

(11)

Since the result set may vary in size among different queries or systems, to compare performances the normalized version of DCG we use an ideal DCG. To this end, it sorts documents of a result list by relevance, producing an ideal DCG at position p (IDCGp), which normalizes the score: nDCGp =

DCG p

(12)

IDCGp

The nDCG values for all queries can be averaged to obtain a measure of the average performance of a ranking algorithm. Note that in a perfect ranking algorithm, the DCGp will be the same as the IDCGp producing an nDCG of 1.0. All nDCG calculations are then relative values on the interval 0.0 to 1.0 and so is cross-query comparable. 4

Implementation details for Information Retrieval Systems

In the implementation of the information retrieval component of our system we relied on the open source search platform Apache Solr. We continue by presenting Solr and how we integrated it in our system. Apache Solr7 is the open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., 7

http://lucene.apache.org/solr/

113

Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application, and it has extensive plugin architecture when more advanced customization is required. The way Solr is usually integrated in an application is shown in Fig. 2.

Fig.2. Common Solr usage [7].

Solr is based on Apache Lucene which is an open source, high-performance text search engine library, developed in 2000. According to [7], the major features of Lucene are: • A text-based inverted index persistent storage for efficient retrieval of documents by indexed terms; • A rich set of text analyzers to transform a string of text into a series of terms (words), which are the fundamental units indexed and searched; • A query syntax with a parser and a variety of query types from a simple term lookup to exotic fuzzy matches; • A good scoring algorithm based on sound Information Retrieval (IR) principles to produce the more likely candidates first, with flexible means to affect the scoring; • A highlighter feature to show words found in context; • A query spellchecker based on indexed content. The text type contains two analyzers, one to be used when indexing information in fields of this type and the other to be used when processing a query which is run on a field of this type. The two analyzers are very similar; the difference is that the query analyzer also employs a SynonymFilterFactory to expand the query with known synonyms. The text processing starts by applying the WhitespaceTokenizerFactory which creates tokens of characters separated by splitting on whitespace. The tokens are further processed with a WordDelimiterFilterFactory which splits words into sub-words and performs optional transformations on sub-word groups. An example analysis is executed on a title of type text using Solr’s analysis debug interface shown in Fig. 3.

114

Fig.3. Analysis interface

Our system uses the dismax query parser, instead of the default Lucene standard parser. The dismax handler has the following features over the standard handler [7]: • Searches across multiple fields with different boosts through Lucene's DisjunctionMaxQuery. • Limits the query syntax to a small subset and there is never a syntax error. This feature is not optional or configurable. • Automatic phrase boosting of the entire search query. • Convenient query boosting parameters, generally for use with function queries. • Can specify the minimum number of words to match, depending on the number of words in a query string. In order to use the dismax query parser our search handler's defType argument was set to dismax. To configure which fields are searched and their boosts we have set the qf argument of the handler: text textro title_copy^1.5 title_copy_ro^1.5 This configuration enables cross-language retrieval: the user can enter the query either in Romanian or in English and the corresponding projects will match. If the query is in Romanian, there will be a match on the fields textro and title_copy_ro. If the query is in English, it will match title_copy and text. Because the query is of dismax type, if a term matches more fields than the maximum score will be retained. This means that if a query has terms that would match document terms both in the Romanian and in the English versions, if the terms happen to be similar, only the maximum value will influence the scoring of the document. In a standard type of query, the scoring is computed as a sum of the scores on each of the fields, which in our case of similar terms matching both English and Romanian, would lead to an unwanted rise in the score of a document, based on the similar terms, not the relevance for the query. Thus, using the dismax handler and a function query

115

containing both English and Romanian content enables us to efficiently solve the cross-language retrieval problem. We have also applied a boosting factor of 1.5 on the title_copy and title_copy_ro fields, such that if a query matches the title of a document, than that document is more relevant to the query than a document that would match the query in its content. The query syntax is restricted in the dismax handler: + and - can be used to specify if a query term is mandatory or prohibited (but not AND, OR, ||, &&). Anything else is escaped if needed to ensure that the underlying query is valid, but you will never get a syntax error. Phrase queries are supported enclosed in double apostrophes. The default search in dismax handler is defined by the q.alt argument: this is the query that is performed if q is not specified. Unlike q it uses Solr's regular (full) syntax, not dismax’s limited one. This allows using the query *:* which matches all the documents in the index. Other parameters configured in our search handler are the number of results to be displayed on a page, rows which default to 10, the field list fl containing all the fields and the score and the output type parameter wt which is set to velocity. The output of the search handler employs Apache Velocity8 templates for presenting the results to the user. The parameters used by the template system are also set on the search handler: browse velocity.properties text/html;charset=UTF-8 The user interface of the retrieval system is shown in Fig. 4.

Fig.4. Search User Interface

8

http://velocity.apache.org/

116

Faceting is enabled on year, studies_level, supervisors, num_students and db_source fields using the following configuration in the search handler: 1

Faceting enhances search results with aggregated information over all of the documents found in the search. Faceted navigation is available by clicking the facet links that apply Solr filter queries} to a subsequent search. A Solr filter query is added by appending a fq parameter to the URL of the corresponding search page. Filters have the advantages that: • They improve performance, because each filter query is cached. • They do not affect the scores of matched documents • Are easier to apply rather than modifying the user's query, which is error prone. • Clarify the logs, which show what the user queried for without it being confused with the filters. 5

Conclusions and Future work

In this paper we present a system that allows for discovering publications similar to a proposed research or diploma project. We have also managed to solve the problem of cross-language retrieval and proven that automatic translation can be considered an option for integrating new projects described in Romanian to our system and successfully finding related material that is described in English. The technologies that have been used, specifically Apache Solr, guarantee the scalability of the system will be possible by distributing and sharding the indexes if they become too large or if the number of users grows to a large extent. The data in our system and the results it provides are also easy to integrate and use in other applications, that can be developed in any language and can communicate with the Solr server through HTTP and receive the information in standard formats like JSON or XML. A future direction of the study would be to integrate the probabilistic Okapi BM25 ranking model and probabilistic language models and compare the results of the information retrieval system with those obtained from using the current scoring method. Another interesting direction of study is using the profiles of the supervisors proposing the research projects to determine their interests and to recommend papers

117

similar to those interests in addition to the results strictly similar to the project descriptions. Also gathering more information about the researchers, like books they have written, courses they teach, etc. should help provide a better picture about the fields of their interest which are related to the projects they propose. Acknowledgments. The work has been funded by the Sectorial Operational Program Human Resources Development 2007-2013 of the Romanian Ministry of Labor, Family and Social Protection through the Financial Agreement POSDRU/89/1.5/S/62557. References 1.

2. 3.

4.

5.

6. 7.

Khaled Abdalgader and Andrew Skabar, Short-text similarity measurement using word sense disambiguation and synonym expansion, In Australasian Conference on Artificial Intelligence, pages 435-444 (2010) R Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley (2002). Alberto Chavez-Aragon and Oleg Starostenko, A shape-based model for visual information retrieval. In Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence (MICAI'07), Alexander Gelbukh and Angel Fernando Kuri Morales (Eds.). Springer-Verlag, Berlin, Heidelberg, 612-622 (2007). Bo Yang, Semantic-Aware Data Processing: Towards Cross-Modal Multimedia Analysis and Content-Based Retrieval in Distributed and Mobile Environments. Ph.D. Dissertation. Pennsylvania State University, University Park, PA, USA. Advisor(s) Ali R. Hurson. AAI3380642, (2007). Fabio Augusto de Santana Silva, Maria del Rosario Girardi, and Lucas Rego Drumond. An Information Retrieval Model for the Semantic Web. In Proceedings of the 2009 Sixth International Conference on Information Technology: New Generations (ITNG '09). IEEE Computer Society, Washington, DC, USA, 143-148. (2009). Christopher D Manning, Prabhakar Raghavan, and H Schutze. Introduction to Information Retrieval, volume 9999. Cambridge University Press, (2008). Eric Pugh and David Smiley. Solr 1.4 Enterprise Search Server. Packt Publishing. (2009).