A more Accurate Clustering Method by using Co-author Social ...

5 downloads 156 Views 1MB Size Report
Feb 28, 2015 - 307–317) http://www.jcomsec.org. A more Accurate Clustering Method by using Co-author Social. Networks for Author Name Disambiguation.
October 2014, Volume 1, Number 4 (pp. 307–317)

http://www.jcomsec.org Journal of Computing and Security

A more Accurate Clustering Method by using Co-author Social Networks for Author Name Disambiguation Mohammad-Hossein Nadimi-Shahraki a,∗ Mostafa Mosakhani a a Faculty

of Computer Engineering, Najafabad branch, Islamic Azad University, Najafabad, Iran.

ARTICLE

I N F O.

Article history: Received: 19 May 2014 Revised: 11 Octeber 2014 Accepted: 8 November 2014 Published Online: 28 February 2015

Keywords: Author Name Disambiguation, Social Networks, Digital Library, Heuristic Hierarchical Clustering

ABSTRACT Digital libraries may keep millions of citation records and bibliographic attributes such as title, authors’ names, and the place of publication. Since the materials and contents in digital libraries are taken from diverse and distinct sources, there are some challenges regarding the use of digital libraries. One of the most important challenges is the ambiguity of authors’ names. Although many methods have been proposed for solving the problem of ambiguous authors’ names, their accuracy still must be enhanced. In this paper, an accurate method for author name disambiguation is proposed. It combines heuristic hierarchical clustering method and social networks to produce clusters with high accuracy. To evaluate the proposed method, an experiment is conducted using real dataset DBLP. The experimental results show that the accuracy can be enhanced using the proposed method. c 2014 JComSec. All rights reserved.

1

Introduction

Digital libraries are complex databases which include a rich collection of digital materials and metadata. Digital libraries include services for receiving items associated with a particular author, multiple searches, browsing, and personalization and building communities with certain fields. Digital libraries such as DBLP, arXiv, MEDLINE, Google Scholar, CiteSeer and BDBCOMP have become important sources of information for the academic community because they provide a comprehensive search of related publications in a focused approach [1]. These systems obtain their information and content from different sources. The problem with collecting information from different sources and saving them in libraries is that they are not ∗ Corresponding author. Email addresses: [email protected] (M. Nadimi-Shahraki), m [email protected] (M. Mosakhani) c 2014 JComSec. All rights reserved. ISSN: 2322-4460

standard which result in a lot of ambiguities. Among these ambiguities, ambiguous name is one of the most critical and difficult ones. A name is a key feature to distinguish between individuals, but people often make mistakes in distinguishing between authors using their names because an author can have several name variations, or several authors can share the same name, and unique identifiers have not been determined for the names of individuals in the world yet. Standardizing data obtained from different sources is difficult and perhaps impossible due to different formats. The problem of ambiguous names in digital resources has a negative effect on the efficiency and retrieval of documents and causes users to make mistakes in identifying citation records of authors’ articles and makes them confused. Common identifiers for authors have not been realized yet. To solve this problem, the present citation records should be disambiguated in digital libraries, that is, any record that contains citations from an author

308 A more Accurate Clustering Method by using Co-author Social . . . — M. Nadimi-Shahraki and M. Mosakhani

with an ambiguous name must be placed in a group, and the results should be presented in a clustered manner so that the individual who is searching does not encounter ambiguity in identifying the author he/she has in mind. Ambiguous names are a particular example of unknown identity issues. In fact a name cannot be expressed by a single identifier [2, 3]. Researchers are investigating different ways to solve unknown identities in the field of computer science. Most methods presented in the field of data mining are based on machine learning, information retrieval, artificial intelligence, and social networks. Citations are key components of digital libraries. They are bibliographic attributes which include the author’s names, title, author’s affiliations and email [4]. In fact, various algorithms have tried to resolve ambiguous names using these features in digital libraries, yet this problem has not been definitely solved, and none of the existing methods can correctly disambiguate all the names in the world. The proposed methods for resolving ambiguous authors’ names fall in three categories: (i) methods that only disambiguate the separate names [5], (ii) methods that only disambiguate the mixed name [3, 6] and (iii) methods that disambiguate both types of ambiguous names (separate names and mixed name)[1, 7, 8]. In general, named entity disambiguation is considered as a clustering issue in which each cluster refers to a specific entity. In this paper, a method for author’s name disambiguation is proposed in which the citation records are clustered by features of citations such as authors’ names and titles. In the previous methods, for solving the problem of ambiguous names, various techniques have been used. Some methods [2, 3] use authors’ relationships which are recognized through the co-authors’ information. The relationship between authors gives useful information for identifying authors who have multiple names or several authors who share the same name. For example, if two ambiguous names have the same co-author, they are considered as one person. This hypothesis is made by considering this issue that in the real world there is very little chance that two different authors with the same name have the same co-author [1]. Moreover, researchers work in specific fields and write articles in their fields so articles written by each author contain keywords that represent the particular field of that author. Some of the methods use semantic information such as title, abstract and keywords to differentiate people with similar names from each other.

Among the proposed methods, those that use the hierarchical clustering technique are useful and appropriate methods for solving this problem. Heuristic hierarchical clustering (HHC) method is one of the best methods [1, 9] which solves the problem of both types of ambiguous names. This method uses a twostep operation for name disambiguation. The results from this algorithm show that its accuracy is about 21% more than the accuracy of supervised methods [7] and 15.5% more than neutral observer methods or unsupervised methods [10]. However, this method has the same disadvantages that other clustering methods for resolving ambiguous names also have. One of its disadvantages is that in the first stage of clustering for combining records, similarity functions such as Jaccard Coefficient and Fragments Comparison Algorithm are used to check the co-authors’ names. If the names of the co-authors of two ambiguous names are considered the same by similarity functions, combination will be done. In some cases, the names of co-authors which are similar to each other may not belong to one person and the two records are combined in a cluster by mistake. This wrong combination causes false records to be added to the cluster in the next steps of hierarchical clustering. For example, two names of B. Wang and Biing Feng Wang are recognized to be similar by Jaccard Coefficient function and are assumed to be one author. In this paper, a hybrid method which uses social network analysis and hierarchical clustering method is employed to solve the problem of similarity functions. By making social networks, it is possible to measure coauthoring features by considering their relationships in real world, therefore common mistakes in similarity functions won’t happen anymore. Here, a new social network which uses the co-authors of ambiguous names is presented. Using social networks of co-authors along with similarity functions can reduce errors and increase the accuracy of clustering results.

2

Research Literature

Disambiguating authors’ names is an issue of named entity disambiguation[11]. Named entity disambiguation is defined as the identification of various references to the entity in textual documents. Authors’ ambiguous names are displayed in two different forms in digital sources: separate names and mixed names. Separate names: an entity name may be written in different ways. In other words, the name of a person may be written in several relatively similar forms in different books and articles. In this case, all of the relevant documents related to an author may not be

October 2014, Volume 1, Number 4 (pp. 307–317)

observed. For instance, the name “Tom Mitchell” from Carnegie Mellon University may be written in the form of “Tom Mitchel” or “Tom M. Mitchel” in his articles. In the searches using these names by some libraries like DBLP, different results are displayed and all works carried out by this author are not shown. This example is depicted in Figure 1. Mixed names: Unlike the previous case, when two authors’ names have the same spelling, their records are combined and displayed. In some cases, such names leading to the retrieval of non-relevant documents make the user confused and cause him make mistakes. Disambiguation methods separate authors’ citations in this case. To understand this issue, see Figure 2. In this example, the name “Mohammed Zaki” refers to both Mohammed Zaki from Al-Azhar University from Nasr city and Mohammed Zaki from Rensselaer Polytechnic Institute from the United States. The records of the two authors appeared with the same name together. In the literature, authors’ names disambiguation methods are usually based on supervised or unsupervised techniques. Unsupervised methods for grouping records related to an author, use similarity functions to evaluate the similarity between features. Examples of works done using unsupervised methods can be found in [1, 3, 4, 6, 9, 10, 12, 13]. In contrast, supervised methods use training sets, which include prelabeled samples, in order to predict the author of a citation or to determine if two citations belong to the same author [7, 8, 14, 15]. Cota et al. introduced a method to disambiguate names in both separate and mixed names [1]. They proposed an efficient clustering method including two main steps. In the first step, only records which have a similar co-author are combined and in the second step, HHC method is applied using additional information from the records within the clusters. At this stage, the features of title, and the place of publication are used. To combine clusters, first the titles and the places of publications of articles are compared using similarity functions (TF-IDF and cosine) and if their similarity is greater than a threshold, the two clusters are combined. DBLP and BDBCOMP libraries and k and F1 criteria are used to evaluate this method. This method has 77% accuracy of the obtained results on average. Experiments show that the accuracy of this method is about 12% more than the accuracy of supervised and unsupervised methods. In 2010, Levin et al. used the results of social network analysis for name disambiguation[16]. Through experiments, it has been shown that using social network analysis improves the quality of results. Social

network analysis provides a kind of evidence that shows how related the authors are in various articles. In a network, relationship, distance, and the strength of the relationship are measured. Distance is an important scale, for example, if the distance between two ambiguous authors is smaller than a threshold in the network, probably they are one single person and if the distance is greater than the threshold, they are assumed to be two persons or two authors. Generally, if the distance of two names is two or less than two, then they are the same name. In addition, if there are many ways or tracks between two authors, it means that there is stronger relationship between them so it is assumed that they are one single person in the real world. Using social network makes it possible to consider co-authoring feature according to the real world relationship and therefore the error that occurs in using similarity functions is reduced. However, if there is no relationship between nodes in the network, name disambiguation is not done.

3 3.1

Proposed Solution Proposed model

In this section, a model is proposed for solving both types of ambiguous names (separate and mixed names) which is a combination of heuristic hierarchical clustering and social network. In this model, the results of the social network analysis are used along with the similarity function and heuristic hierarchical clustering, to reduce the errors caused by similarity functions. The reason for using social network along with similarity function is that by creating authors’ network and measuring accurate and genuine relationships between co-authors of an ambiguous name, the errors of similarity function do not occur. Similarity functions operate on textual similarity only, and they fail in some cases. The overall implementation of the proposed model is shown in Figure 3. First of all, all records belonging to authors with the same name are extracted from databases of digital libraries. Then, the preprocessing step is followed involving removal of redundant words in the title and the place of publication and blocking module. The module groups the ambiguous candidates into separate classes. Then, a hierarchical clustering method is used along with a social network function for clustering these ambiguous groups. In the first step of this method, only records having similar ambiguous authors’ names and the same coauthor are combined, so the co-authors of ambiguous names are compared two by two. If it is approved by both Jaccard similarity function and social network function, that the co-authors are the same, combina-

309

310 A more Accurate Clustering Method by using Co-author Social . . . — M. Nadimi-Shahraki and M. Mosakhani

Figure 1. An author with several similar names

Figure 2. Several authors with similar name

tion is done. The operation of social network function is explained later. In the second step, the title and the place of publication of articles and the authors’ affiliations are examined and if they are similar, the clusters generated in the first step are combined. After clustering in two steps, the clusters which are made are stored in the database. The social network function receives the two authors’ names which Jaccard similarity function recognized

to be the same. After checking the distance between them, if two names are close and similar in the network, it will give the positive result. To illustrate the operation of this function, the following example is considered. Given ambiguous name of A.Gupta and her/his two different co-authors Riccardo Bettati and R.Bettati. By using a similarity function such as Jaccard coefficient, both two coauthors are recognized as the same author. These two

311

October 2014, Volume 1, Number 4 (pp. 307–317)

Clustring result {C1,C2,C3,….,Cn} Social network function Database

Extracted records

Preprocessing

r1, r2, r3,….., rn

phase

First

Second

step

step HHC+

Figure 3. Proposed Model

P1.1

P1.2

P3.2

Riccardo Bettati

P1

A Gupta

P3.1

A Gupta

Q. Nguyen

P3.3

P3

R Bettati

P3.4 P2.1

P2.2

A Gupta

D. Ferrari

Oppliger Rolf

P3.5

W. Heffner

P2 P1.1 P2.4

R Bettati

P2.4

Mark Moran

Mark Moran

Figure 4. Social network of Riccardi Bettati and R.Bettati co-authors

records, with their other features are shown as follows. Riccardo Bettati I c d g intern dismribute compute rystemdynamo resource all migrate multi part real time commona Gupta Moraine: rbettati: roppliger I d m interact distribute multimedia system telecommun servicesecure architecture tenet schemea Gupta Riccardo Bettati and R.Bettati are given to the social network function. The social network function searches in the database for these two names in order to obtain other records for more accurate examination. The following record is extracted from the database because of having the similar name, i.e. R.Bettati. Ferrari: mark moraine: qnsnyen:rbetteti: wheffner nossdav network open system connect establish multi part real time common a Gupta This function is created to check whether the two names of Riccardo Bettati and R.Bettati belong to the same author or not. According to the authors of these three records, a graph is created by the social network function which is shown in Figure 4. As Figure 4 shows, P1 is the paper titled “Dynamic resource migration for multi-party real-time communication”, P2 is the paper titled “A security Architec-

ture for Tenet Scheme 2” and P3 is the paper titled “Connection Establishment for Multi-Party Real-Time Communication”. In general, the social network function returns true or false according to the distance between two nodes. In Figure 4, the authors and papers are shown by ovals and squares, respectively. There are two kinds of links. The links that connect to a paper by a straight line represent the authors. The point is that a particular author may be the author of several articles and he/she is shown in the graph several times. The second type of links which is illustrated by a dotted line implies the relationship between two authors with the same name. Authors that are connected to each other by dotted lines accurately represent the same person. Here we want to determine whether two names of Riccardo Bettati and R.Bettati from papers P1 and P2 belong to the same person or not. As you can see in Figure 4, the two names have great textual similarity and there is a path connecting P1 and P2 to P2.1 and P1.2 nodes so they are socially related. Therefore, if they are only textually related and there is no other relationship between them, it is not possible to say definitely that they are the same person. Also if the distance between similar nodes is little, they are

312 A more Accurate Clustering Method by using Co-author Social . . . — M. Nadimi-Shahraki and M. Mosakhani

CLUSTER 1

CLUSTER 1 A Gupta Funka-Lea,. "The use of hybrid models to recover cardiac wall motion in tagged MR images." Computer Vision and Pattern Recognition, 1996. Proceedings CVPR'96, 1996 IEEE Computer Society Conference on. IEEE, 1996.

A Gupta Funka-Lea,. "The use of hybrid models to recover cardiac wall motion in tagged MR images." Computer Vision and Pattern Recognition, 1996. Proceedings CVPR'96, 1996 IEEE Computer Society Conference on. IEEE, 1996.

CLUSTER 2 A Gupta Oppliger, Rolf, Moran, M., & Bettati, R. (1996). A security architecture for Tenet Scheme 2 (pp. 163-174). Springer Berlin Heidelberg.

A Gupta, Bettati, Riccardo, "Dynamic resource migration for multiparty real-time communication." Distributed Computing Systems, 1996., Proceedings of the 16th International Conference on. IEEE, 1996.

CLUSTER 2

A Gupta Oppliger, Rolf, Moran, M., & Bettati, R. (1996). A security architecture for Tenet Scheme 2 (pp. 163-174). Springer Berlin Heidelberg.

A Gupta Oppliger, Rolf, Moran, M., & Bettati, R. (1996). A security architecture for Tenet Scheme 2 (pp. 163-174). Springer Berlin Heidelberg.

A Gupta, Bettati, Riccardo, "Dynamic resource migration for multiparty real-time communication." Distributed Computing Systems, 1996., Proceedings of the 16th International Conference on. IEEE, 1996.

A Gupta, Bettati, Riccardo, "Dynamic resource migration for multiparty real-time communication." Distributed Computing Systems, 1996., Proceedings of the 16th International Conference on. IEEE, 1996.

CLUSTER 3 A Gupta , Kurt Rothermel. "Failure recovery for multi-party real-time communication. "Multimedia Computing and Systems, 1996., Proceedings of the Third IEEE International Conference on. IEEE, 1996.

A Gupta Funka-Lea,. "The use of hybrid models to recover cardiac wall motion in tagged MR images." Computer Vision and Pattern Recognition, 1996. Proceedings CVPR'96, 1996 IEEE Computer Society Conference on. IEEE, 1996.

A Gupta , Kurt Rothermel. "Failure recovery for multi-party real-time communication." Multimedia Computing and Systems, 1996., Proceedings of the Third IEEE International Conference on. IEEE, 1996.

A Gupta , Kurt Rothermel. "Failure recovery for multi-party real-time communication. "Multimedia Computing and Systems, 1996., Proceedings of the Third IEEE International Conference on. IEEE, 1996.

Figure 5. Heuristic hierarchical clustering [1]

assumed to be the same. As noted above, the names of R.Bettati and Riccardo Bettati are searched in the database to extract other articles for more exact examination. The third paper, P3 is found as a new record. In this paper, P3.1 and P3.6 nodes are similar to P1.2 and P2.4 nodes. So as you can see, Riccardo Bettati has a common coauthor in all three articles (P1.2, P3.1, and P2.1) and in the two papers P2 and P3 has a common co-author (P3.6 and P2.4). Therefore, it can be said that P1.1 and P3.3 and P2.3 are likely to be the same person. To increase the accuracy of the final results, we consider two nodes with the same label and the distance of equal or less than 2, a single person. Based on the above model, we have presented a method that we call HHC+. In the following, we explain this method. 3.2

The Proposed Method

As stated before, our method, like HHC methods, performs name disambiguation operation in two stages after preprocessing phase. Groups include ambiguous names in the list. In Figure 5, four records of the ambiguous group A.Gupta are visible. After identifying candidates for clustering, in the first step of this method, only records from ambiguous authors who have a similar co-author are combined. The second and third records in this figure are put together in a cluster because they have similar co-authors’ names (Riccardo Bettati and R.Bettati). This general extract is based on the assumption that it is rare that two au-

thors with similar names have a common co-author[1]. In fact, the assumption here is that two similar names with the same co-author refer to a single author. Some records of groups that do not share any common characteristics are not clustered together. Therefore, the chance of combining different authors’ publications is reduced. This example is taken from Cota et al. Our method also is composed of two steps, with making changes to the first step. Algorithm 1 describes the overall operation of our method. It receives a list R containing the records of papers and returns a list of authors’ clusters (C). Line 1 is consisted of the preprocessing step .In the second line, a list is made of empty clusters. Our method operates in two main steps (the first step, lines 3 to 16, and the second step, lines 17 to 35) to generate a list of clusters from articles’ records. The author’s name in each record (a) is compared with the name of the author of the first record in each cluster (c) using the similarity and the social network functions. If the name of the author (a) is similar to the name of the author of the first record in (c) and a co-author’s name from (a) is similar to some of the co-authors’ names in (c) and the social network has returned the true value, the record (a) is inserted in cluster (c) (line 8), otherwise a new cluster is created using record (a) (line 13) and is added to list G (line 14). The social network function is called in line 7 and if

October 2014, Volume 1, Number 4 (pp. 307–317)

Algorithm 1 HHC+ Input: List R of citation records Output: List C of clusters of authorship records 1: A ← P reprocess Citation Records(R) 2: G: lists of clusters 3: for all a in A do 4: inserted ← f alse 5: C ← f irst(G) 6: while not inserted do 7: if the author name from a is simliar to the author name from the first authorship record of C and there is a coauthor name in a that is similar to some coauthor name in C and Social N etwork(coauthor name in a, similar coauthor name in C) then 8: Insert Authorship Record(a, C) 9: inserted ← true 10: end if 11: end while 12: if inserted = false then 13: C ← Create N ew Cluster(a) 14: Append(G, C) 15: end if 16: end for 17: f used ← true 18: while f used do 19: f used ← f alse 20: for all c1 in G do 21: for all c2 in G do 22: if c1 6= c2 and the first author name from c1 is similar to the first author name from c2 then 23: tt1 ← Get W ork T itle T erms(c1 ) 24: tt2 ← Get W ork T itle T erms(c2 ) 25: tv1 ← Get P ublication V enue T itle T erms(c1 ) 26: tv2 ← Get P ublication V enue T itle T erms(c3 ) 27: if T itle Similarity(tt1 , tt2 ) > title threshold or V enue Similarity(tv1 , tv2 ) > venue threshold then 28: c1 ← F use(c1 , c2 ) 29: remove(G, c2 ) 30: f used ← true 31: end if 32: end if 33: end for 34: end for 35: end while the names of co-authors belong to one person according to the network’s measurements, a positive result is returned by the social network function and the next steps are done in order to combine records. In the following, we present an algorithm that performs measurements between two nodes. In Algorithm 2, the sets of r and c consist of the co-authors’ names and similar co-authors’ names respectively. In this algorithm, two co-authors of ambiguous names are received and if the two names are the same, true is returned. In the beginning, a search is done in the database for the names of co-authors (line 2) and more records are obtained for more accu-

rate examination. A social network of authors name is constructed by using the obtained names (line 4). Then, each node with a label similar to the entered names of co-authors is marked (line 5). For example, in Figure 4, nodes P1.1 and P3.3 and P2.3 are marked for checking. Then, the distances of marked nodes are checked two by two. If the distance between two nodes was greater than 2 in the created network, then the two names are considered to be of two persons and a false value is returned (line 7). If the distance between all co-authors was two or less, the true value is returned. In this case, if two co-authors of an ambiguous name are considered as one person by the social network,

313

314 A more Accurate Clustering Method by using Co-author Social . . . — M. Nadimi-Shahraki and M. Mosakhani

Algorithm 2 Social networks function Input: r, c Output: true or f alse 1: i = 1, R = ∅, E = ∅ 2: En ← Search the similar names (coauthor name in r, similar coauthor name in c) in the database 3: R ← E1 , E2 4: Social network Rn and r and c is creating 5: I = find nodes that is labeas similar with names in r and c 6: for all node Ii and Ii+1 do 7: if Relationship Distance node Ii and node Ii+1 > 2 then 8: return f alse 9: end if 10: end for 11: return true then the two records are placed together in a cluster. The second step is done to reduce fragmented clusters by using the title and place of publication features. Thus, the second step of our method combines clusters hierarchically and these combinations are based on the content of clusters. Clusters are mutually compared in the algorithm (line 22) and if the authors’ names are the same in the clusters and identical titles and places of publication exist in the records of the two clusters, they are combined together. To combine clusters, first title words and place of publication are weighted using TF- IDF vector then two citations are compared using the similarity cosine function. If their similarity is more than the threshold (lines 20 to 34), then their clusters will be combined. This process continues until there is no other cluster to be combined. The result is a list of clusters (G) with their related records and each cluster contains records of an author with an ambiguous name. Whenever two or more similar clusters are combined, richer information is provided for comparisons in later stages. For example, by combining two records of an author, more words about more names and titles and places of publication are obtained. This information helps to increase the similarity between clusters and after each combination the information increases[1]. 3.3

Similarity measure functions

Similarity functions are functions that are used for evaluating the similarity between two words or two fields. There are several similarity functions, some of which are used to evaluate the similarity between two words like Jaccard similarity. Some functions are used to evaluate the similarity between two sentences, such as cosine similarity. For implementing this method, Jaccard similarity coefficient is used for checking coauthoring feature and cosine similarity is used to check title and place of publication. Jaccard coefficient is defined between A1 and J2

citations as the size of the intersection over the size of the union of A1 acd A2 . Jaccard coefficient similarity function is given in Equation (1): j(A1 , A2 ) =

|TA1 , TA2 | |TA1 , TA2 |

(1)

In cosine similarity function, each word in the title or the place of publication of a paper is considered as a single word and the cosine between two clusters is calculated using a feature vector where each feature is in accordance with value of TFIDF [17] that is adapted to words. Cosine similarity function is shown in Equation (2): P cosin(Ci , Cj ) =

k cik

∗ cjk

|ci | ∗ |cj |

(2)

In Equation (2), ci vector is the feature vectors of the terms in the work title and cj vector is the feature vectors of the terms in the publication venue titles of the clusters ci and cj ; corresponds to the norm of the vector c; and cik and cjk correspond to the value of the k th feature in the vectors ci and cj , respectively.

4

Experimental Evaluation

Experiments are performed for both groups of ambiguous names separately and the name of author, the list of co-authors’ names, the title and the place of publication are considered as features. For each ambiguous name, the program is run ten times and in each run the results are recorded to determine their mean. Finally, the means of all runs are obtained for all ambiguous groups. In this study, our implementations are carried out using the C # language, version 2012 and an i5 system, 2.67 GHZ and 3GB RAM.

315

October 2014, Volume 1, Number 4 (pp. 307–317)

Table 1. The DBLP dataset

Name

4.1

# Authors

# Records

A. Gupta

26

576

A. Kumar

14

243

C. Chen

60

798

D. Johnson

15

368

J. Maritn

16

112

J. Robinson

12

171

J. Smith

29

921

K. Tanaka

10

280

M. Brown

10

153

M. Jones

13

260

M. Miller

12

405

TOTAL

220

4278

The Dataset

The dataset is taken from the digital library of DBLP. This dataset contains several ambiguous groups. The number of extracted records is 4287 that are attributed to 220 separate authors. On average, approximately 20 records are attributed to each author. In 2270 of these records, which comprises 53% of all records, the authors’ names are in a short format. The original set is made by [7] and it is used with minor modifications in several other works [7, 10, 13, 18]. The authors’ records are labeled or tagged manually using external foreign sources of information such as authors’ publication page, affiliations and email. The authors’ records are disambiguated in the dataset and are clustered somehow. In some unknown samples, e-mails were sent to the authors in order to determine the unknown aspects of their works. Records that have poor information were removed from this set. In[7], the summarized place of publication name was replaced by its full name. In our experiments, the original set is limitedly used with some considerations. Table 1 shows more details of the DBLP dataset. As seen in Table 1 in the first row, the name A.Gupta has 576 records belonging to 26 people. In general, any method that can obtain the number of clusters close to the number of people is suitable and useful. Among these names, author names such as C. Chen make disambiguation hard because the records of this name belong to 60 different people. Names like J. Robinson make disambiguation easy and more accu-

rate records are produced because the number of individuals is fewer and the majority of authors have only one record. 4.2

Evaluation criterion

k metric: The K metric [19] is used for evaluating the quality of the generated clusters. K metric is obtained with the geometric average cluster purity (ACP) and the average author purity (AAP). ACP parameter evaluates the purity of the generated clusters according to reference clusters. If the produced clusters are pure, the value of this parameter is 1. ACP formula is as follows: q

ACP =

R

1 XX n2ij N i=1 j=1 ni

(3)

AAP parameter evaluates the fragments of clusters that are automatically generated in relation to the reference clusters. If fragmentation of the generated clusters is little, the result of this parameter is close to one. The value of this parameter is between zero and one. The formula is as follows: R

AAP =

q

1 XX n2ij N j=1 i=1 ni

(4)

In both Equations (1) and (2), the parameter R is the number of clusters that are produced manually (reference clusters). The parameter N is the total number of records in ambiguous groups. The parameter q is the number of clusters generated automatically by the method. The parameter nij is total number of elements in cluster i, which is automatically generated and belongs to cluster j which is generated manually. The parameter ni is the total number of items in cluster i which are automatically generated. K metric is obtained by the following formula: √ k= 4.3

ACP ∗ AAP

(5)

Baselines

We compare our method against the unsupervised Heuristic-Based Hierarchical clustering (HHC) method from Cota et al. This method is described in Section 2. To implement the HHC method we eliminated the social network function of our method and also minor changes were made in it.

316 A more Accurate Clustering Method by using Co-author Social . . . — M. Nadimi-Shahraki and M. Mosakhani

Table 2. Experimental results on DBLP dataset.

4.4

Name

AAP

ACP

K

A.Gupta

0.71

0.91

0.80

A.Kumar

0.70

0.81

0.72

C.Chen

0.49

0.62

0.55

D.Johnson

0.55

0.89

0.69

J.Martin

0.76

0.95

0.84

J.Robinson

0.73

0.94

0.82

J.Smith

0.73

0.87

0.79

K.Tanaka

0.64

0.94

0.77

M.Brown

0.74

0.88

0.80

M.Jones

0.78

0.96

0.86

l.Miller

0.79

0.96

0.87

TOTAL

0.69

0.87

0.77

Experiments

In all experiments, values of 0.2 and 0.4 are considered for similarity threshold of publication venue titles and work titles respectively, in DBLP dataset because it is proved [1] that these values produces the best results. Table 2 shows the average of results after ten runs for each ambiguous group in DBLP, using our hybrid method (HHC+). As it can be seen, for the last two names the accuracy of 96% is obtained and the lowest rate was found to be 62% for East Asian names. On average, the value of clusters’ purity was 87% s and the value of cluster fragmentation was 69%. To compare our method against the HHC method, the HHC method was run under similar conditions to our method. Since we use the same dataset and evaluation measures and similar conditions in experiments, the values are directly comparable. The experiments were performed twice using the list of co-author names and all attributes, respectively. Due to the high similarity of the operation two methods, the results are fairly similar. As Table 3 shows, HHC+ obtains high purity and low fragmentation compared to HHC using the coauthor names and also all attributes. With the increasing number of ambiguous names and the names of their co-authors, this improvement is more visible. Using the proposed method, the common error (the first failure from HHC method) in the first step of HHC method was resolved. In our method two authors of the same ambiguous group C. CHEN were considered as two separate authors and put in two correct and separate clusters by using social network function.

Table 3. Experimental results on DBLP dataset with various attributes.

HHC+

HHC

Attributes ACP AAP K ACP AAP K Using the 0.97 0.44 0.70 0.91 0.37 0.64 coauthor names Using all attributes

0.87 0.69 0.77 0.85 0.63 0.74

The social network function found that the two names Biing Feng Wang and B. Wang belong to two authors, and the function returned false value and finally the wrong combination was prevented. Likewise wrong compounds are avoided in first step and clusters are created with high accuracy. Creating clusters with high accuracy in the first step leads to the correct clusters in the next steps and thus increase the accuracy of the final results.

5

Discussion and Conclusion

In this paper, we proposed a hybrid method for authors’ names disambiguation in digital libraries. The proposed method aims mainly to solve the problem of both types of ambiguous names (separate and mixed). In this method, hierarchical clustering is combined with a social network to prevent the probable errors of similarity function in the first stage of hierarchical clustering and to generate clusters by measuring real relationships between authors. The experimental results show that using social network of co-authors improves the similarity function and therefore the accuracy of clustering can be enhanced. Using the proposed method, an accuracy of 96% for some names and an average accuracy of 87% ambiguous names of DBLP was obtained.

References [1] Ricardo G Cota, Anderson A Ferreira, Cristiano Nascimento, Marcos Andr´e Gon¸calves, and Alberto HF Laender. An unsupervised heuristicbased hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9):1853–1870, 2010. [2] Dongwook Shin, Taehwan Kim, Hana Jung, and Joongmin Choi. Automatic method for author name disambiguation using social networks. In Advanced Information Networking and Applications (AINA), 2010 24th IEEE International Conference on, pages 1263–1270. IEEE, 2010.

October 2014, Volume 1, Number 4 (pp. 307–317)

[3] Xiaoming Fan, Jianyong Wang, Xu Pu, Lizhu Zhou, and Bing Lv. On graph-based name disambiguation. Journal of Data and Information Quality (JDIQ), 2(2):10, 2011. [4] Byung-Won On and Dongwon Lee. Scalable name disambiguation using multi-level graph partition. In SDM, pages 575–580. SIAM, 2007. [5] Byung-Won On, Ergin Elmacioglu, Dongwon Lee, Jaewoo Kang, and Jian Pei. Improving groupedentity resolution using quasi-cliques. In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 1008–1015. IEEE, 2006. [6] In-Su Kang, Seung-Hoon Na, Seungwoo Lee, Hanmin Jung, Pyung Kim, Won-Kyung Sung, and Jong-Hyeok Lee. On co-authorship for author disambiguation. Information Processing & Management, 45(1):84–97, 2009. [7] Hui Han, Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In Digital Libraries, 2004. Proceedings of the 2004 Joint ACM/IEEE Conference on, pages 296–305. IEEE, 2004. [8] Jian Huang, Seyda Ertekin, and C Lee Giles. Efficient name disambiguation for large-scale databases. In Knowledge Discovery in Databases: PKDD 2006, pages 536–544. Springer, 2006. [9] Ricardo G Cota, Marcos Andr´e Gon¸calves, and Alberto HF Laender. A heuristic-based hierarchical clustering method for author name disambiguation in digital libraries. In SBBD, pages 20–34. Citeseer, 2007. [10] Hui Han, Hongyuan Zha, and C Lee Giles. Name disambiguation in author citations using a kway spectral clustering method. In Digital Libraries, 2005. JCDL’05. Proceedings of the 5th ACM/IEEE-CS Joint Conference on, pages 334– 343. IEEE, 2005. [11] Lˆe Diˆe u Thu. Named entity disambiguation in digital libraries. 2010. [12] Indrajit Bhattacharya and Lise Getoor. A latent dirichlet model for unsupervised entity resolution. In SDM, volume 5, page 59. SIAM, 2006. [13] Kai-Hsiang Yang, Hsin-Tsung Peng, Jian-Yi Jiang, Hahn-Ming Lee, and Jan-Ming Ho. Author name disambiguation for citations using topic and web correlation. In Research and advanced technology for digital libraries, pages 185–196. Springer, 2008. [14] Vetle I Torvik and Neil R Smalheiser. Author name disambiguation in medline. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(3):11, 2009. [15] Anderson A Ferreira, Adriano Veloso, Marcos Andr´e Gon¸calves, and Alberto HF Laender. Effective self-training author name disambigua-

[16]

[17]

[18]

[19]

tion in scholarly digital libraries. In Proceedings of the 10th annual joint conference on Digital libraries, pages 39–48. ACM, 2010. Felipe Hoppe Levin and Carlos A Heuser. Evaluating the use of social networks in author name disambiguation in digital libraries. Journal of Information and Data Management, 1(2):183, 2010. Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. Modern information retrieval, volume 463. ACM press New York, 1999. Denilson Alves Pereira, Berthier Ribeiro-Neto, Nivio Ziviani, Alberto HF Laender, Marcos Andr´e Gon¸calves, and Anderson A Ferreira. Using web information for author name disambiguation. In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, pages 49–58. ACM, 2009. Itshak Lapidot. Self-organizing-maps with bic for speaker clustering. 2002.

Mohammad-Hossein Nadimi-Shahraki was born in Iran. He received his PhD in computer science from University Putra of Malaysia (UPM) in 2010. Currently, he is a full time Assistant Professor in Faculty of Computer Engineering of Islamic Azad University of Najafabad (IAUN). His research interests include data mining, web mining, social network mining, and recommender systems. Mostafa Mosakhani was born in Iran. He received his master of computer software engineering in 2013 from faculty of computer engineering, Islamic Azad University of Najafabad under supervisory of Dr. Mohammad Hossein Nadimi. He is currently continuing his research on name disambiguation in digital library by data mining techniques.

317

Suggest Documents