Context based user ranking in forums for expert finding using

0 downloads 0 Views 1MB Size Report
Dec 1, 2013 - database using SQL server integration services. After- ... Microsoft, Oracle, TurboTax, Dell, Amazon, IBM, Yahoo .... everyday advice, discussion and support. ... large scale real data. .... reopened with a 5 USD life-time fee.
Inf Technol Manag (2014) 15:51–63 DOI 10.1007/s10799-013-0173-x

Context based user ranking in forums for expert finding using WordNet dictionary and social network analysis Amin Omidvar • Mehdi Garakani Hamid R. Safarpour



Published online: 1 December 2013  Springer Science+Business Media New York 2013

Abstract Currently, online forums have become one of the most popular collaborative tools on the Internet where people are free to express their opinions. Forums supply facilities for knowledge management in which, their members can share their knowledge with each other. In this regard, The main problem regarding to the knowledge sharing on forums is the extensive amount of data on them without any mechanism to determine their validity. So, for knowledge seekers, knowing the expertise level of each member in a specific context is important in order to find valid answers. In this research, a novel algorithm is proposed to determine people’s expertise level based on the context. AskMe forum is chosen for the evaluation process of the proposed method and its data has been processed in several stages. First of all, a special crawling program is developed to gather data from AskMe forum. Then, raw data is extracted, transformed, and loaded into a designed database using SQL server integration services. Afterwards, people’s expertise level for specified context is calculated by applying the proposed method on the processed data. Finally, evaluation tests are applied in order to calculate the accuracy of the proposed method and compare it with other methods.

A. Omidvar (&)  M. Garakani Department of Computer Engineering and IT, Amirkabir University of Technology, 424 Hafez Ave, 15875-4413 Tehran, Iran e-mail: [email protected] M. Garakani e-mail: [email protected] H. R. Safarpour Department of Finance and Economics, Southern Illinois University of Edwardsville, Edwardsville, IL, USA e-mail: [email protected]

Keywords Algorithms  Expert finding  Online forums  Link analysis  WordNet dictionary

1 Introduction Expertise sharing through Internet applications is considered as the next step of knowledge management for organizations by many scholars. Since available knowledge resources and expertise are limited in organizations, the demands for seeking knowledge from external sources such as Internet are increasing. Nowadays, employees often seek knowledge from Internet applications for problem solving, especially in industries where finding the best solution is challenging [1, 2]. Some of these applications like forums play an important role in knowledge sharing among their members. Forums are a special environment where people are free to express their ideas by posting questions and answers. Some of online forums’ attributes like ease of use, usefulness, social influence, and ease of communication have caused them to be welcomed by many internet users and become one of the most popular and useful web applications. People help each other in forums because of many reasons like reputation-enhancement benefits, direct learning benefits, expected reciprocity and altruism. Many popular companies such as Microsoft, Oracle, TurboTax, Dell, Amazon, IBM, Yahoo and others run forum for both customers and employees for knowledge sharing and technical support. Based on the literature, an effective knowledge management system should provide not only documented knowledge, but also experts who can do social or given organizational task and have valuable knowledge [3]. A user would be recognized as an expert in a particular subject if he or she has high level of knowledge in that area.

123

52

Because of importance of forums as a tool for knowledge sharing and the necessity of finding valuable answers among huge number of posts, lots of methods have been proposed to address such needs. Moreover, some drawbacks such as the unstructured nature of posts and high volume of shared data are outstanding challenges that should be approached. The first important drawback of forums is the response time gap. There is a significant time difference between experts’ questions and newbies’ questions in regard to the time allocated to answering the questions. Experts’ questions take more time in order to get the first reply. It is more common that in a question and answer (QA) forum, the likelihood of solving a problem suggested by an expert is lower relative to a problem questioned by a newbie. The complicated questions are often lost in the flood of easy ones. By calculating the knowledge level of each member, the system could direct the questions to users that are likely to answer them. The second problem relates to the recognizing the best answer among all received posts. Since forums are flooded with extensive amount of data, there should be a mechanism in order to determine the validity of the sent answers in which the questioner could distinguish the valid answers among the received replies. Most of forums have a mechanism to determine users’ reputation which is usually shown with duke stars. For example in Oracle java forum, the members’ knowledge level is shown with duke stars. The higher number of user’s duke stars is the more level of knowledge he or she has. One disadvantage of the mentioned method is that its validity depends on user’s judgment. Moreover, duke stars could not represent in which fields the users have knowledge. For example, one person is an expert in servlet programming, but he or she is a newbie in mobile programming. Therefore, an automatic method to calculate people’s knowledge based on contexts in QA forums is required. 1.1 Motivation In order to solve the aforementioned problems, many researches have been conducted so far. These researches can be classified in two categories which are link analysis and information retrieval methods. Methods in both categories have some drawbacks which some of them are mentioned in this section. In link analysis methods, the users’ social network is extracted from their sent posts and then each user’s authority score is calculated. One of the major problems regarding to link analysis methods is that they do not utilize the content of the posts. Moreover, link analysis methods are not context based which means they are unable to recognize the expertize level of users in the specific topic.

123

Inf Technol Manag (2014) 15:51–63

In addition, link analysis methods could not detect unrelated sent replies to the asked question (e.g. some posts sent for advertisement purposes). There are a number of information retrieval methods which could capture the users’ expertise automatically. Information retrieval methods can be used to find experts because texts contain terms that are relevant to the users’ expertise areas. Although information retrieval methods are effective, they cannot calculate each user’s influence in their social network. Social cognitive researches have shown that social influence has an important role in the perception of expertise [4]. Also, most of the proposed information retrieval methods are unable to recognize synonyms or even words that are related to one context such as Java and Pascal in which both are computer programming languages. Finally, most previous methods in both categories have one common weakness that the user-specific information needs is neglected, whereas our proposed method could find experts dynamically based on topic queries. Such problems are the motivation to propose a novel expert finding technique that is able to process unstructured textual data along with the relations between members. Finding a solution for aforementioned challenges is the primary goal of this study. 1.2 Contribution First of all, the novelty of the proposed method is finding semantically relevant posts employing text mining techniques and semantic similarity function provided by WordNet dictionary. This technique figures out the relevance of users’ post to the specified context. Another unique feature of the proposed method is that the social network of users in an online forum could be extracted and weighted according to the calculated similarity values. Also, by employing customized link analysis algorithm, the relative importance of each user will be calculated according to the specified context. The proposed method has an important advantage in comparison with prior expert finding methods due to the ability to recognize synonym words by using WordNet dictionary. The experimental results evidently shows that proposed method outperforms other approaches that only employ link analysis or context analysis techniques alone. Moreover, the context based expert finding approach leads to a higher accuracy value for expert finding. This paper is organized as following. In Sect. ‘‘2’’, the most related works are reviewed. Then in Sect. ‘‘3’’, our methodology containing a step by step explanation of its stages is presented. In Sect. ‘‘4’’, the accuracy of the proposed method is calculated and compared with other methods and at last, our study is rounded off with a conclusion in Sect. ‘‘5’’.

Inf Technol Manag (2014) 15:51–63

2 Literature review Since 15 years ago, Expert Finding has been one of the top issues, which many researchers pay attention to. In the past, most researches in the field of expert finding were conducted to find experts in organizations. Now, the expert finding tendency is more inclined toward finding experts on the Internet. So far, knowledge sharing environments like forums are helpful tools for sharing knowledge and establishing relations between members. Expert finding methods categorized in two groups that are information retrieval and link analysis. 2.1 Expert finding based on link analysis Link analysis methods widely used to find the experts in online forums. Expert finding through link analysis methods comprises of constructing a user social network, and then utilizing some kind of ranking algorithms to figure out users’ authority. Link analysis algorithms such as HITS and PageRank were used in order to determine the level of expert’s knowledge. This work was carried out as a research project to rank transferred emails between IBM’s employees. They found that using link analysis algorithms can have better results in comparison with content analysis methods. However, their research had some drawbacks such as small size of their network which could not show the characteristics of knowledge relations in real online communities [5, 6]. In 2007, Ackerman conducted a research in order to find experts in sun java forum. They pre-processed extracted posts to create members’ social network. By leveraging a simulation technique, they examined the effects of network structure on the accuracy of their proposed algorithm. Then, they discovered structural features which could affect the performance of expert finding algorithms [7]. QuME engine was proposed to match questioners with responders on sun java forum. QuME engine could examine members of sun java forum to find the best answerers for each posted question. This engine has not been evaluated yet. Therefore there is no evidence that it could work properly in online forums. Also QuME engine along with other expert finding algorithms just calculate the members’ expertise in java programming language and they could not determine users’ level of expertise in different sub-areas of java concept map [8]. Adamic along with her team members proposed a novel model in order to find the best answers in Yahoo Answers (YA) [9]. YA is an active forum with a great diversity of knowledge being shared. All categories of this forum were studied properly and then categorized based on interaction patterns and content properties which exist among its

53

members. While interactions in some categories resemble expertise sharing online communities, others incorporate everyday advice, discussion and support. With such diversity of categories in which members can take part, they discovered that some members focus narrowly on specific topics while others participate across various categories. The entropy of members’ activities was depicted in their research. They discovered that lower entropy correlates with higher rating answers. Also they predicted which answer will be chosen as the best answer by combining user attributes with reply characteristics [9]. SNPageRank algorithm was proposed in order to find experts on social networks. A star schema data warehouse was employed to store vast amount of data which were gathered from FriendFeed website. Finally, results were compared to the experts’ opinions utilizing spearman’s correlation function [10]. In another research, it was shown that social influence is a key factor in order to make solutions become more broadly accepted [11]. Komeda et al. [12] concluded that ‘‘cognitively central members of a community can provide social validation for other members’ knowledge, and that, concurrently, their knowledge is confirmed by other members, leading to the perception of well-balanced knowledge or expertise in the focal task domain’’. Other researches showed that network centrality has a positive relation to technical innovation and administrative in organizations. Like formal authority, the higher network centrality expresses a higher degree of control and access over sensitive data [13, 14]. 2.2 Expert finding based on information retrieval There are a number of information retrieval methods which could capture the users’ expertise automatically. Alternatively, Balog et al. [15] presented two models based on probabilistic language modelling techniques. A textual representation of the users’ knowledge according to the associated documents was built in the first model. Therefore, by employing this representation, the candidates could be ranked accordingly. In the second model, all documents were ranked according to a given context and then it was determined whether a candidate is an expert or not based on the associated documents. In 2010, a novel method was proposed in order to rank members in sun java forum based on their estimated knowledge which was represented with a numeric score between 0 and 1. The proposed method utilizes the java forum’s posts in order to implicitly create a knowledge model for each participant [16]. Abel et al. [17] proposed a rule-based recommender system for online communities. Their system employs common users’ posts and users’ rating score information to extract some rules for its recommendations. By utilizing collaborative techniques, a novel recommender system was

123

54

proposed by Castro-Herrera [18]. According to the users’ contribution in threads, their profiles would be created and then exploited to find the similarity among users. Moreover, user’s interests would be mined using main keywords of their posts. Moreover, a novel expert recommender system for online communities was proposed by Ziberna and Vehovar [19]. Balog and Rijke [20] proposed a new method which aims at finding similar experts instead of using an explicit description of the demanded expertise. The impact of rich query modelling along with non-local evidence on expert finding systems is investigated in another research [21]. However, in the research which was conducted by Liu et al. [22], a method based on language models is presented which automatically finds experts in online communities. The method proposed in this work has been evaluated on large scale real data. Answer Garden system analyses threads and categorizes them into ontology. By navigating the ontology tree, one could find experts as the leaf nodes. Ontology may not be available and building it is a cumbersome task. Moreover, this system needs predefined experts which cannot be changed easily [23, 24]. Expert–Expert-Locator (EEL) system was proposed by Streeter and Lochbaum [25] which could request for technical information. This system is capable to build semantic space of organizations and terms by utilizing statistical matrix decomposition in order to find term-based semantic similarity in textual data. ContactFinder was developed by Krulwich and Burkey [26] in order to match bulletin board members to people who have required knowledge to help them based on historical data. 2.3 Context similarity algorithms For the context based expert finding algorithm in this research, there is need for a method in order to figure out the similarity between two contexts. So, a broad research was conducted about context similarity functions. The results of this research indicate that there are two categories of methods to compute the similarity between two contexts which these categories along with their subcategories are illustrated in Fig. 1. The first category consists of algorithms which calculate the similarity between the contexts just through examining grammatical and lexical structures of them. One of the most famous algorithms in this category is Levenshtein algorithm [27]. One of the major disadvantages of the methods in this category is their weakness in finding the synonyms. For example, Levenshtein algorithm recognizes that fridge and refrigerator are different subjects but they would be used interchangeably and ones can assume that they are synonyms.

123

Inf Technol Manag (2014) 15:51–63

Grammatical and lexical approach

Based on the ontology

Context similarity algorithms Dictionary based approach

Based on the meanings of the words

Hybrid approach

Fig. 1 Context similarity algorithms

Methods in second category use a dictionary to compute the similarity between different contexts. In contrast with the algorithms in the first category, they consider the semantic of contexts to determine the amount of similarity between them. The algorithms in this category are classified in three sub-categories. Some of them employ WordNet ontology to find the similarity between contexts such as OSS function [28]. Ontology tree is referred to an approach in which the contexts are represented in a context ontology tree, and in a hierarchical structure. Semantic similarity techniques which use WordNet ontology tree can be classified in three categories: edge based, node based, and hybrid approaches. The simplest similarity measurement is the edge based approach. The distance of two concepts is calculated through numerating the edges between them. Resnik [29] subtracted the path length between two concepts from the maximum possible path length in order to compute the similarity of two concepts. Also another popular edge based approach was proposed by Chodorow and Leacock [30]. They scaled the shortest path between two concepts according to the maximum depth of the hierarchical ontology. In node based approach, the similarity of two concepts is defined as the ratio between the amounts of information needed to state the commonality of them [31]. In hybrid approach, the aforementioned approaches are combined. Jiang and Conrath [32] proposed a hybrid approach and defined the edge strength between two concepts as the difference of information between them. One of the limitations of the mentioned approaches is that the ontology tree may be constructed unfairly. In particular, one branch of a node may be split generally while the other branch is split in more details. Methods in second sub-category use text mining approach to find the similarity between words through their meaning in the dictionary. To overcome the mentioned limitations for the methods in the first sub-category, in [33] a novel method is presented which uses the WordNet English lexical reference system. Using this approach, words that come out very frequently in the concept’s explanation in the WordNet,

Inf Technol Manag (2014) 15:51–63

55

In Fig. 2, the framework of the proposed methodology for expert finding in forums is depicted. So, this section is divided according to the phases of the proposed methodology.

community has international popularity between Internet users. People who are the members of MetaFilter can send their posts to this site and others may then comment on these posts and also readers can mark other user’s comments as they like them. In the early years, membership of MetaFilter was free but after the year 2004, signups were reopened with a 5 USD life-time fee. AskMe forum is the most successful subsite of MetaFilter that was launched in 2003. In this forum, members are permitted to send their posts to the online community without the link requirement. AskMe rapidly grew to a strong side community with slightly different etiquette requirements. Today, threads in this forum cover a broad spectrum of topics. This online community has various categories where people can post and comment on different topics. Furthermore, questions have some assigned tags which are related to the context of the question. Also the best answer among all answers for each resolved question in AskMe forum is chosen by the user who asked that question [35]. In this study, this forum is used as a dataset since this is a well-established online community that could have accumulated high volume of posting-replying data with rich social interactions embedded in them. A web crawler is developed in order to crawl all pages in AskMe forum from the establishment of this forum in 2003 to the end of year 2010, with approximately eight years of data.

3.1 Dataset

3.2 Crawler

Matthew Haughey founded MetaFilter forum in 1999. This site was programmed by its founder using Microsoft SQL Server and Macomedia ColdFusion. Currently, this online

A web crawler is computer software that utilized in order to download the web pages associated with the given seed URLs and could recursively extract their hyperlinks. Web

have no significant weights, while words that are used only for few times have more weights. This is due to the fact that the frequent words are unlikely to discriminate concepts sufficiently. Finally, the similarity between two arbitrary contexts of an ontology tree, which are two different nodes, is computed according to the value of their weighted similarity distance. WordNet dictionary contains 155,287 words organized in 117,659 synsets for a total of 206,941 word-sense pairs. This dictionary provides an online dictionary which is constructed not only in alphabetical order, but in more conceptual way showing semantic relationships in terms of similar meanings, part of relations, and subsumption relations among concepts [34]. Since WordNet dictionary includes a lot of words and their meanings, so the context similarity function which was proposed in [33] could be a good method to cover significant amount of words which are in the shared information on online forums. The last sub-category consists of methods which employ ontology and words meaning in dictionaries together in order to calculate context similarity.

3 Methodology

Fig. 2 Proposed methodology for context based user ranking

123

56

Inf Technol Manag (2014) 15:51–63

crawlers are used in applications that process and collect a large number of web pages such as, search engines, web mining tools, and so on. In order to extract data from AskMe forum, a web crawler program in C# language is designed and developed for this forum. Conceptually, the executed algorithm by this web crawler is very simple. First, this web crawler navigates to the archive section of the AskMe forum where all questions are archived by category. Second, it selects a category from a set of candidates, third, extracts the hyperlink of each question contained therein, next, downloads the associated web pages of each question and finally, saves them in html format by renaming them to the ID of the visited question. Each question in the AskMe forum has a unique ID. If the ID of one question is typed next to the URL of the forum, then the question along with its all assigned tags and answers will be shown. 3.3 Extract, transform, and load processes In order to execute information retrieval queries, data should be stored in a well-formed structure. First of all, data are extracted from text files by employing regular expressions in C# programming language. Some cleansing and pre-processing tasks are performed on extracted data in transformation phase. Afterwards, the transformed data are put into the designed database. 3.4 Context based user ranking algorithm In this section, a novel algorithm called context based user ranking (CUR) is proposed. Some features are added to PageRank algorithm that can be employed for expert finding in online forums [36]. In the proposed algorithm, the nodes are the members of the forum and the links between them represent the sent questions and replies. To explain how CUR algorithm works, an example from an online forum is depicted in Fig. 3. Fig. 3 Transformation of forum to community expertise network

A

B

C

The members of the forum along with their asked questions are depicted on the left side of the Fig. 3. In Fig. 3, the questions of each member are shown with the member’s colour. An arrow from member D to member C’s question shows that D answered the C’s question. Akerman transformed sun java forum to community expertise network (CEN) which is shown in the right side of the Fig. 3 [7]. In order to construct the CEN network, an arrow is drawn from the member who is asking a question to any person who replied him. When one member answers another member’s question, it indicates that the answerer has superior knowledge on the topic than the questioner. The steps of CUR algorithm are as following: 3.4.1 Adjacency matrix calculation First, the weight of each link in CEN is calculated based on the number of questions that are answered for each specific context. This means CEN is created for each context with different weights of edges. Imagine in the previous example, there need to find the experts in the field of computer. So the weight of each directed edge in the CEN should be calculated according to the computer context. As shown in formula (3), the weight of each directed edge is calculated based on the number of replies from A to B having the context of computer in consideration. Next, the similarity between all the existing key concepts in each post with computer concept is computed. The average of words similarity is considered as the amount of relevance between the post and the computer. The similarity calculation is performed using WordNet similarity function which employs text mining approach to compute the similarity between two words. To do that, the similarity between two contexts is computed using documents’ similarity computation methods used in text processing [33]. In WordNet, if we consider that documents correspond to concepts and words for a document corresponds to words for a concept, we can use the document’s similarity computation method for the D

E

B

E

A

WDE

AB

W

W

AE

Users

W

AD

AC

W

123

DA

C

W

WCD

Posts

D

Inf Technol Manag (2014) 15:51–63

57

concept’s similarity computation. Let the word frequency wfi be defined as the number of occurrence in explanation of a concept. Also, the concept frequency cfi, is defined as the number of concepts in which the word occurs in a collection of N concepts. Here, the inverse concept frequency (ICF) factor is given by log N/cfi. A combined word importance indicator is the product (wf 9 icf), where the importance, or weight, Wij, of the word wi in the concept cj is defined as the word frequency multiplied by the inverse concept frequency. That is, N Wij ¼ wfij : log cfi

ð1Þ

Through the wf 9 icf values, the word-concept matrix, which is the basis for the similarity measurement of a pairs of concepts, can be made. The pair wise comparison of a matrix gives N*(N - 1)/2 different pairs of similarity coefficients for the concepts. Finally, the similarity between concepts is measured by the following cosine coefficient measure. That is, PN i¼1 xi yi simðX; Y Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð2Þ PN 2 PN 2 x : y i¼1 i i¼1 i where x and y are vectors of wf 9 icf values and N is the dimension of the vector space. The cosine coefficient can have a value between 0 and 1. As seen, the more the concepts have overlapping words, the higher the cosine coefficient value would be. If there are no overlapping words between the concepts, the cosine coefficient is zero. This was a brief description of the distance function which is used in the proposed algorithm [33]. In order to compute the weight of each directed link in CEN, the relation of both questions and answers to the specified context must be considered. The formula 3 is proposed for computing the weight of each edge in CEN network. ! PN p NAB X ð Sim ð PW ; C Þ Þ t t¼1 WAB ¼ Np p¼1 !! PN c t¼1 ðSim ðPCt ; CÞÞ  ð3Þ Nc Table 1 summarizes the notations which are used in formula 3. As the purpose of this research is to find the expertise level of each member in online community, so the relevance of each question with the specified context must be determined. This task can be done through calculating the average similarity of all existing keywords in a question with specified context. Since the similarity of unrelated questions with the context is negligible, they will not have an impact in weight calculation. Therefore, irrelevant

Table 1 Notations used in formula 3 Label

Meaning

NAB

NAB is the number of A’s questions that were replied by member B

NP

NP is the number of keywords of A’s question P

C

C is the context (e.g. Internet)

PWt

PW is a symbol for keywords of the A’s question P

Sim

Sim is the context similarity Function that is used to calculate distance between context C and the keyword PWt

Nc

Nc is the number of keywords of all B’s answers to the A’s question P

PCt

PC is a symbol for keywords of the B’s answers to the A’s question P

questions will not have any effect on overestimating the members’ level of expertise. In order to distinguish between the respondents to a question, the similarity of answers to the specified context is also determined. To do that, all the keywords of B’s answers to the A’s question will be extracted and then average similarity of them with the specified context will be calculated. So those replies which are more relevant to the context can have greater impact on increasing the weight of the edge in CEN network. Also by employing mentioned idea, the spams and advertising posts will be ignored. As mentioned above, in order to determine the relevance of questions or answers with the specified context, the average distance between their keywords to the specified context will be calculated. So, a method is needed to extract the keywords automatically. To do that, standard processing methods were applied, i.e., stop filtering and stemming methods (using the porter stemmer in C#), and then the weighted average similarity between the keywords and context was calculated [37]. Having computed the weight of the edges, the CEN adjacency matrix is shown in the Table 2. 3.4.2 Transition probability matrix calculation First of all, transition probability matrix is built by using adjacency matrix. If a row in adjacency matrix has no number except zero, then each element will be replaced by 1/N. For the other rows, the value of each element will be calculated using formula 4. Wij Pz¼N z¼0 Wiz

ð4Þ

Wij is the weight of the directed edge between node i and node j which is computed employing formula 3. Then the resulting matrix will be multiplied by 1-a where a is a probability of teleport operation in PageRank algorithm

123

58

Inf Technol Manag (2014) 15:51–63

Table 2 Adjacency matrix for the simple CEN A

B

C

D

E

A

0

WAB = 0.321

WAC = 0.196

WAD = 0.651

WAE = 0.804

B

0

0

0

0

0

C

0

0

0

WCD = 0.406

0

D

WDA = 0.228

0

0

0

WDE = 0.745

E

0

0

0

0

0

(typically a might be 0.1). Finally, a/N will be added to all elements of the matrix, in order to obtain transition probability matrix which is shown in Table 3. 3.4.3 User ranking In this sub-section, computation of each member’s rank is presented. Suppose the initial probability distribution (X0) is equal to (1, 0, 0, 0, 0). This vector indicates that the anchor is in node A with probability of 1, and with probability of 0 in other nodes. The probability vector X1 is computed by multiplying X0 by transition probability matrix (P): X1 ¼ X0  P ¼ ð1; 0; 0; 0; 0Þ  P ¼ ð0:02; 0:166501; 0:109452; 0:317109; 0:386937Þ The probability vector X2 is calculated by multiplying matrix P by X1. The same way is used for the rest of vectors (X3, X4, …, XN). Continuing in this fashion, we see that the distribution converges to the steady state (i.e. X14 = X15 = X16 = … = XN). The X14 shows the level of each member’s expertise in the specified context by a number between 0 and 1. The results of calculations are shown in Table 4.

4 Evaluation In this section, the results of assessments on CUR algorithm and a comparison with some prior ranking algorithms are discussed. Three different contexts are selected to verify the performance of CUR algorithm. The evaluation approach is based on finding the best replier among all respondents for each asked question in the test data sets. 4.1 Expertise ranking algorithms In this section, eight user ranking algorithms are briefly introduced as the criteria of comparisons. These algorithms are run against the prepared data set and the outcomes are compared with the results of CUR algorithm. One of the simplest measures which can be used to find the experts in online forums is AnswerNum. This measure numerates the

123

Table 3 Transition probability matrix for the simple CEN (a = 0.1) A

B

C

D

E

A

0.3869

0.3171

0.1094

0.1665

0.02

B

0.2

0.2

0.2

0.2

0.2

C

0.02

0.92

0.02

0.02

0.02

D

0.7091

0.02

0.02

0.02

0.2308

E

0.2

0.2

0.2

0.2

0.2

Table 4 Computation results A

B

C

D

E

X0

1

0

0

0

0

X1

0.02

0.166501

0.109452

0.31711

0.386937

X2

0.186495

0.122549

0.121408

0.224068

0.34548

X3

0.1515

0.131567

0.120928

0.268922

0.327084

X4

0.159271

0.124752

0.116109

0.256404

0.343464

X5

0.158353

0.127612

0.118526

0.256098

0.339411

X6

0.158074

0.127263

0.118229

0.257786

0.338648

X7

0.15823

0.127022

0.118004

0.257236

0.339509

X8

0.158225

0.127156

0.11813

0.257191

0.339298

X9

0.158202

0.127142

0.118115

0.257289

0.339252

X10 X11

0.158212 0.158211

0.127128 0.127135

0.118102 0.118109

0.257258 0.257255

0.3393 0.339289

X12

0.15821

0.127134

0.118109

0.257261

0.339286

X13

0.15821

0.127134

0.118108

0.257259

0.339289

X14

0.15821

0.127134

0.118108

0.257259

0.339288

X15

0.15821

0.127134

0.118108

0.257259

0.339288

X16

0.15821

0.127134

0.118108

0.257259

0.339288

number of questions replied by each user. The more number of questions replied by each user, the higher level of expertise he or she has. Counting how many users one helped could be a better indicator than counting the number of his or her answers. In SNA, this measurement is using the indegree of a node. In this method, one who posts fewer answers, but helps more people, will be considered more knowledgeable. Another useful method is Z-number which is based on the assumption that asking a lot of questions can be an indicator that the questioner lacks expertise on those topics.

Inf Technol Manag (2014) 15:51–63

59

Member’s replying and asking patterns are combined in Z-number measurement in order to show their expertise level. If a member sends a answers and q questions then the Z-number of the questioner is computed using formula 5. aq z ¼ pffiffiffiffiffiffiffiffiffiffiffi ð5Þ aþq Z-degree can be calculated by replacing the value of q and a parameters in formula 5 with the number of users which a user receives replies from and replies to, respectively. It is obvious if a member answers and ask equally (i.e., a = q) then the expertise level of the member is zero. If the number of asked questions is more than the number of replies, the z-number is negative, and likewise for the positive. ExpertiseRank, a PageRank-like algorithm was proposed in order to calculate members’ expertise levels in java forum [7]. This algorithm was exploited to solve the potential problem with counting the number of people which a user helped, or the number of answers which a user posted. A user who answers 50 unskilled members’ questions will be ranked as equally expert as another one who answers 50 professional programmers’ questions. Moreover, HITS was adopted to calculate users’ expertise [7]. It means the authority value of HITS algorithm corresponds to the rank of the users. Cluster based document model (CBDM) was proposed to find expert users in question answering forums [22]. CBDM considers the target question as a query and the experts’ profiles as documents. The last expertise ranking algorithm is SNPageRank which was proposed to find the experts on FriendFeed social network [10]. 4.2 Finding the best answerer for each asked question This evaluation is based on finding the best replier among all respondents for each asked question in a forum. In AskMe forum, the best answerer for each question is selected by its questioner. In this evaluation approach, the best expert finding algorithm is an algorithm which can detect the best answerer for each asked question correctly. In fact, having the best answerer for each question and compare it with the selected candidate by the results of the algorithm, the accuracy of the algorithm can be evaluated. The following steps are performed in order to evaluate the proposed method: 1.

First of all, 150 questions for each three different contexts (i.e. Internet, travel, and music) are selected from a data base. The selected questions for each context are checked to make sure whether all the questions are answered and also each question has the selected best answerer or not. Each context includes

2.

3.

4.

three different sets of 50 questions. The first set, consists of the questions which the total number of answers to each question is less than or equal to 5. This means ranking algorithms have higher chance to select the best replier correctly among all the repliers for each question. The second set consists of questions which the count of their answers falls in the range between 5 and 10 and for the third set this number is greater than 10. The selected questions in previous stage will be deleted from the training dataset, consequently they will not have any impact on the training phase of the ranking algorithms. The expertise level of all members will be determined through running the expertise ranking algorithms. Each expertise ranking algorithm produces different ranking for users. Finally, each ranking algorithm recommends the best answerer for each question based on their ranking results, i.e., the highest ranked replier will be suggested as the best answerer.

Two metrics are employed to compare CUR algorithm with other algorithms. The first one is accuracy metric which is computed by employing the formula 6. Accuracy ¼

N1 N1 þ N2

ð6Þ

In formula 6, N1 is the number of questions which expertise ranking algorithm can find the best answerer for them and N2 is the number of questions which expert finder method can not find the best answerer for them. Figure 4 depicts the accuracy of ranking algorithms in finding the best answerer for each questions which the total number of replies is less than or equal to 5. Figures 5 and 6 illustrate the accuracy of ranking algorithms in finding the best answerer for the questions in the second and third test data set respectively. As shown in Fig. 4 through Fig. 6, CUR algorithm outperforms prior ranking algorithms in all three contexts in terms of accuracy. Since the number of respondents for each question in third test data set is higher than the other test data sets, the chance of algorithms for recommending the correct best answerer for each asked question is lower. Therefore, the algorithms obtain their highest accuracy value for the first test data set and their accuracy in second test data set is higher than the third one. As illustrated in Figs. 4, 5 and 6, the accuracy value of Z-degree, Z-number, Answernum, and Indegree are lower than other algorithms since these algorithms neglect some important information such as the reputation of users in CEN and the content of the posts. Moreover, the accuracy value of ExpertiseRank, SNPageRank and HITS is lower

123

60 Fig. 4 The accuracy of expert finding algorithms for three different contexts for the first test data set

Inf Technol Manag (2014) 15:51–63 Internet

Music

Travel

Internet

Music

Travel

Internet

Music

Travel

50 45 40 35 30 25 20 15 10 5 0

Fig. 5 The accuracy of expert finding algorithms for three different contexts for the second test data

50 45 40 35 30 25 20 15 10 5 0

Fig. 6 The accuracy of expert finding algorithms for three different contexts for the third test data

50 45 40 35 30 25 20 15 10 5 0

123

Inf Technol Manag (2014) 15:51–63

61

Table 5 Calculation of MRR metric

As shown in Table 6, the proposed method outperforms other expert finding algorithms in all three contexts in terms of MRR metric. CBDM performed better than other expert finding methods except CUR in all contexts and SNPageRank in Travel context. Also Z-number has better results on MRR metric in comparison with Z-degree, Answernum, and Indegree however Z-number accuracy value was close to their accuracy values. It is because the Z-number has better ability to find other top answers in comparison with Z-degree, Answernum, and Indegree algorithms.

Question

Results

Correct response

Rank

Reciprocal rank

Q1

A1, A2, A3

A3

3

1/3

Q2

A1, A2, A3

A2

2

1/2

Q3

A1, A2, A3

A1

1

1

Table 6 Comparing expert finding methods based on MRR metric Internet

Music

Travel

CUR

38.4

37.86

34.4

Z-degree

20.99

18.93

15.49

Z-number

25.58

24.1

18.12

Answernum

18.14

15.59

11.21

Indegree

17.82

14.93

13.26

ExpertiseRank

26.08

29.21

27.22

SNPageRank

29.41

29.43

31.88

HITS CBDM

23.34 32.32

33.46 33.29

29.61 27.54

than CUR algorithms since these algorithms are not context based. Also CUR performs better than CBDM in terms of accuracy, because it considers the reputation of users. The second metric is mean reciprocal rank (MRR) which was proposed in [38]. MRR is a statistical measurement for evaluating any algorithm which produces a set of responses ordered by probability of being correct. The inverse of the first correct response’s rank is called reciprocal rank of that query and the average of all reciprocal ranks is called MRR. In AskMe forum, if a user decides one’s answer is useful or seems interesting to her, she can assign a like tag to that answer. So, the answers could be ranked according to the number of their assigned like tags. For one question, the answer which is opted as the best answer by questioner is considered as the best answer. Thereafter, the answer which has the highest number of assigned like tags is considered as the second best answer for that question and so on. For example, suppose an expert finding algorithm is used to find the best answerer for the questions in Table 5. The ranked list of answerers by expert finding algorithm is mentioned in the results column. In the correct response column, the best answer is mentioned. The value of MRR metric for three samples in Table 5 is 0.61. To compare CUR with other expert finding methods based on MRR metric, 50 questions for Internet context are selected from the data base. The selected questions for each context will be checked if they are answered and have at least 5 answers with assigned like tags. The results of evaluation are presented in Table 6.

5 Conclusion In this paper, a novel method for expert finding in online forums is proposed. After conducting a comprehensive study on expert finding research areas and trends, a new architecture for CUR on online forums is proposed. Special crawling software is designed and developed to gather data from AskMe forum. By utilizing ETL processes, the data is extracted, transformed and loaded into a data base. Then CUR algorithm has been proposed to rank members of AskMe forum based on their level of knowledge in a specified context. First, members’ relationship of this forum is transformed into CEN. Then the edges between vertices in CEN are weighted according to the number of sent posts along with their relevance to a context determined by employing WordNet similarity function. Finally, the users’ ranking is calculated using SNA techniques. In contrast with prior well-known expert finding algorithms, CUR can determine the users’ expertise level based on specified contexts. Also, CUR is invulnerable to Profile Injection Attacks and other similar malicious threats due to automatically processing the words being used in the users’ posts. For evaluation, CUR is compared with other ranking algorithms to find the best answerer for each question. As the results shows, CUR algorithm outperforms other ranking algorithms in terms of accuracy and MRR. Employing the computed expertise value of users in forums, it can be determined that how much a user should trust others’ answers. Moreover, the questions which are relevant to the field of users’ expertise could be shown to them instead of displaying all questions. Also, CUR algorithm can be used for expert finding task which is an important issue in knowledge management systems. Acronyms caused a lot of problems for keyword extraction since they are considered as other terms, therefore, as a future work, it is suggested to employ domain ontology for enhancing the accuracy of stemming algorithms. In scientific forums, e.g. Java programming language forum, users usually use a lot of domain specific terms such as J2EE, servlet and web service in their posts

123

62

which do not exist in WordNet dictionary. Thus, context similarity algorithms which employ WordNet dictionary cannot find the similarity between the domain specific terms. Using concept map for each scientific domain, the semantic similarity between each term will be calculated. Furthermore, a recommender system could be designed and developed for QA forums to employ CUR algorithm for expert finding in order to show each question intelligently to users who are more knowledgeable for answering the question.

References 1. Constant D, Sproull L, Kiesler S (1996) The kindness of strangers: the usefulness of electronic weak ties for technical advice. Organ Sci 7(2):119–135. doi:10.1287/orsc.7.2.119 2. Wasko M, Faraj S, Teigland R (2004) Collective action and knowledge contribution in electronic networks of practice. J Assoc Inform Syst 5(11):494–513 3. Yimam-Seid D, Kobsa A (2003) Expert finding systems for organizations: problem and domain analysis and the DEMOIR approach. J Organ Comput Electron Commer 13:1–24. doi:10. 1207/S15327744JOCE1301_1 4. Romney AK, Weller SC, Batchelder WH (1986) Culture as consensus: a theory of culture and informant accuracy. Am Anthropol 88(2):313–338. doi:10.1525/aa.1986.88.2.02a00020 5. Campbel CS, Maglio PP, Cozzi A, Dom B (2003) Expertise identification using email communications. In: The proceeding of the 12th International conference on information and knowledge management, New Orleans, LA, pp 528–531. doi:10.1145/ 956863.956965 6. Dom B, Eiron I, Cozzi A, Zhang Y (2003) Graph-based ranking algorithms for email expertise analysis. In: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, New York, NY, ACM Press, pp 42–48. doi:10.1145/882082.882093 7. Zhang J, Ackerman MS, Adamic L (2007) Expertise Networks in Online Communities: Structure and Algorithms. In: Proceedings of the International World Wide Web Conference Committee (IW3C2) ACM, pp 221–230. doi:10.1145/1242572.1242603 8. Zhang J, Ackerman MS, Adamic L, Nam KK (2007) QuME: A mechanism to support expertise finding in online help-seeking communities. In: Proceedings of the 20th Symposium on User Interface Software and Technology, Newport, RI, ACM, p 111–114. doi:10.1145/1294211.1294230 9. Adamic LA, Zhang J, Bakshy E, Ackerman MS (2008) Knowledge sharing and Yahoo Answers: Everyone knows something. In: Proceedings of the 17th International Conference on World Wide Web (WWW’08), New York: AC, p 665–674. doi:10.1145/ 1367497.1367587 10. Kardan A, Omidvar A, Farahmandnia F (2011) Expert Finding on Social Network with Link Analysis Approach. In: Proceedings of the 19th Iranian Conference on Electrical Engineering (ICEE), Tehran, Iran, p 1–6 11. Cross R, Rice RE, Parker A (2001) Information seeking in social context: structural influences and receipt of information benefits. IEEE Trans Syst Man Cybern Part C Appl Rev 31(4):438–448. doi:10.1109/5326.983927 12. Kameda T, Ohtsubo Y, Takezawa M (1997) Centrality in sociocognitive networks and social influence: an illustration in a

123

Inf Technol Manag (2014) 15:51–63

13.

14. 15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

group decision-making context. J Pers Soc Psychol 73(2):296–309. doi:10.1037/0022-3514.73.2.296 Ibarra H (1993) Network centrality, power, and innovation involvement: determinants of technical and administrative roles. Acad Manag J 36(3):471–501. doi:10.2307/256589 Burt RS (1982) Toward a structural theory of action. Academic Press, New York, NY Balog K, Azzopardi L, Rijke M (2006) Formal models for expert finding in enterprise corpora. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, Seattle, Washington, USA, p 43–50. doi:10.1145/1148170.1148181 Kardan A, Garakani M, Bahrani B (2010) A method to automatically construct a user knowledge model in a forum environment. In: Proceedings of the 33rd International ACM SIGIR conference on Research and development in information retrieval (SIGIR), p 717–718. doi:10.1145/1835449.1835581 Abel F, Bittencourt I, Costa E, Henze N, Krause D, Vassileva J (2010) Recommendations in online discussion forums for e-learning systems. Trans Learn Technol 3(2):165–176. doi:10. 1109/TLT.2009.40 Coastro-Herrera C, Cleland-Huang J, Mobasher B (2009) A recommender system for dynamically evolving online forums. In: Proceedings of the 3th Conference ACM Recommender Systems, p 213–216. doi:10.1145/1639714.1639751 Zˇiberna A, Vehovar V (2009) Using social network to predict the behavior of active members of online communities. In: Proceedings of the International Conference on Advances in Social Network Analysis and Mining, p 119–124. doi:10.1109/ASON AM.2009.24 Balog K, Rijke M (2007) Finding similar experts. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam, The Netherlands, p 23–27. doi:10.1145/1277741.1277926 Balog K, Rijke M (2008) Non-Local Evidence for Expert Finding. In Proceedings of the 17th ACM conference on Conference on information and knowledge management, Napa Valley, California, USA, pp 489–498. doi:10.1145/1458082.1458148 Liu X, Croft WB, Koll M (2005) Finding experts in communitybased question-answering services. In: Proceedings of the 14th ACM international conference on information and knowledge management, Bremen, Germany, p 315–316. doi:10.1145/1099 554.1099644 Ackerman MS, Malone TW (1990) Answer Garden: a tool for growing organizational memory. In: Proceedings of the ACM SIGOIS and IEEE CS TCOA Conference on Office Information Systems, Cambridge, MA, p 31–39. doi:10.1145/91474.91485 Ackerman MS, McDonald DW (1996) Answer Garden 2: merging organizational memory with collaborative help. In: Proceedings of the 1996 ACM Conference on Computer Supported Cooperative Work, Boston, MA, p 97–105. doi:10.1145/240080. 240203 Streeter L, Lochbaum K (1998) An expert/expert-locating system based on automatic representation of semantic structure. In: Proceedings of the Fourth Conference on Artificial Intelligence Applications, San Diego, CA, pp. 345–350. doi:10.1109/CAIA. 1988.196129 Krulwich B, Burkey C (1996) The contactfinder agent: answering bulletin board questions with referrals. In: Proceedings of the 13th National Conference on Artificial Intelligence, Portland, OR, p 10–15 Heeringa W (2004) Measuring dialect pronunciation differences using Levenshtein distance, Ph.D. thesis, Rijksuniversiteit Groningen Schickel-Zuber V, Faltings B (2007) OSS: a semantic similarity function based on hierarchical ontologies. In: Proceedings of the

Inf Technol Manag (2014) 15:51–63

29. 30.

31.

32.

33.

12th International Joint Conference on Artificial Intelligence, p 551–556 Resnik P (1995) Using information content to evaluate semantic similarity. In: Proceedings of the IJCAI05, p 448–453 Chodorow M, Leacock C (1997) Combining local context and WordNet similarity for word sense identification. In Fellbaum, pp 265–283 Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning Jiang J, Conrath DW (1997) Semantic Similarity based on corpus and lexical taxonomy. In: Proceedings of 10th International Conference COLING Morid MA, Omidvar A, Shahriari HR (2011) An Enhanced Method for Computation of Similarity between the Contexts in Trust Evaluation Using Weighted Ontology. In: Proceedings of the 10th IEEE International Conference on Trust, Security and

63

34.

35.

36.

37.

38.

Privacy in Computing and Communications, p 721–725. doi:10. 1109/TrustCom.2011.93 Leea S, Huh S, McNiel RD (2008) Automatic generation of concept hierarchies using WordNet. Expert Syst Appl 35(3): 132–1144. doi:10.1016/j.eswa.2007.08.042 Silva S, Goel L, Mousavidin E (2009) Exploring the dynamics of blog communities: the case of MetaFilter. Inf Syst J 19:55–81. doi:10.1111/j.1365-2575.2008.00304.x Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: Bringing order to the web. Technical Report, Stanford Digital Library Technologies Project Porter M (2006) an algorithm for suffix stripping. Program: Electron Lib Inform Syst 40(3):211–218. doi:10.1108/003303 30610681286 Voorhees EM (1999) TREC-8 Question Answering Track Report. In: Proceedings of the 8th Text Retrieval Conference. p 77–82

123

Suggest Documents