International Conference on Advanced Technologies, Computer Engineering and Science (ICATCES’18), May 11-13, 2018 Safranbolu, Turkey
Application of PageRank Algorithm in Linked Data Y. GÜLTEPE1 , K. AKYOL1 and A. KARACI1 1
Kastamonu University, Kastamonu/Turkey,
[email protected] 1 Kastamonu University, Kastamonu/Turkey,
[email protected] 1 Kastamonu University, Kastamonu/Turkey,
[email protected]
Abstract - The main purpose of the semantic web is to develop standards and technologies that will enable well-defined and linked information and services to be easily computer-readable and computer-understandable in the web environment. Linked data is one of the approaches used to acquire meaningful integrity by gathering data-related data collections by creating semantic links between the web pages that make up the content of the semantic web. Linked data is based on RDF (Resource Description Framework) technology. RDF is a data model that provides spaceindependent formal semantics with respect to chart resources. In a linked data application, the most important decision point is how to access the linked data. Linked data crawler is a program that explores linked data in web by tracking RDF links. In this work, DBLP (Database S ystems and Logic Programming) data set is used as a source of Linked Data. DBLP gradually expanded toward all fields of computer science. An example will be presented related to pageRank sorting of RDF resources in the DBLP dataset. As a result; the search area has shrunk and search results have improved. Keywords - Linked Data, S emantic Web, PageRank, RDF, DBLP.
I. INT RODUCT ION The semantic web, the linked data shows its existence today as a winning state of reality [1]. The existing data is spread on the web in the form of triples which can be understood by the structural and machine by complying with Resource Description Framework (RDF 1 ) standard. With the spread of these triples, datasets are formed in certain areas. The resultant data sets form the basis of the linked data as a result of being associated with each other at the semantic level thanks to RDF links [2, 3]. Tim Berners-Lee first introduced the basic principles of the bound data concept in 2006 [2]. In 2007, A project called Linking Open Data (LOD), aimed at semantic associating open sets of data in open license status, was launched [2]. Thanks to the LOD project, it is open to anyone who wants to publish their data on the web according to their data guidelines. Because of this, the linked data space is constantly under development. In the web search domain, text based search engines rank documents and domains by their popularity and relevancy 1
https://www.w3.org/RDF/
314
levels [3]. However, in the semantic web and linked data fields, ranking has a more complex structure in the search process due to semantic relationships. Most of the targeted ranking methods for semantic web and link data are generated using commo n web search and ranking algorithms such as PageRank [3]. According to data sources, linked data ranking operations are classified in different forms. [4]. These classifications are ontology ranking, graph ranking, entity ranking, and RDF document ranking. DBpediaRanker is hybrid ranking algorithm that uses Dbpedia resources. DBpediaRanker searches the DBpedia graph to calculate the value of resemblance for each couple of resources attained over the search process in graph exploration and queries the external information sources one by one [5]. In this work, DBLP data set is used as a source of Linked Data. An example will be presented related to pageRank sorting of RDF resources in the DBLP dataset. The DBLP contains meta-data known to exist in computer science biography, consisting of more than a few millio n magazines by more than 1 million authors, and more than 1.8 million publications in conference proceedings. DBLP has begun to focus on two fundamental areas of computer science, database systems and logical programmin g (abbreviation). Despite this, however, it is rapidly evolving to cover all disciplines of computer science. The system proposed in [6] combines the advantages of these two approaches, semantic-based and text-based information retrieval, with the ability to make advanced analysis by taking advantage of the results from the most popular search engines. In addition, ranking algorithms have been developed with textual and link analysis. A relative ranking system is different from PageRank style algorithms. The song is shudder. Every node in the graph has no importance on its own, but it is ranked by its neighboring nodes. Each node has a separate value based on the query being expressed. Instead of a single weight for a single source, a weight representing the similarity relation between the sources is calculated as the PageRank style algorithm is in operational mode [6].
International Conference on Advanced Technologies, Computer Engineering and Science (ICATCES’18), May 11-13, 2018 Safranbolu, Turkey II. RANKING LINKED DAT A A. Ranking Methods Linked data ranking works vary in ranking operations according to the data sources used. For example, an ontologybased ranking method can be used based on the use of semantic web technologies for web services. It is also based on the likelihood of advertising using ontology and the similarit y between field service parameters. The TF-IDF 2 scoring is the sequencing result of the RDF sources. The RDF triple provides a "language modeling approach" that allows the resulting RDF table to be sorted. The "Maximal-Marginal Level" approach is used to repeatedly rank the top-ranked results and this approach calculates the relevance level as a result. Another approach is entity ranking approach. Entity ranking approach are two types; It is ranking for entities (such as persons, places and organisations) and entity type ranking (named as resource description ranking and property ranking). B. PageRank Algorithm PageRank is a value that indicates the population of a web page [1]. Web browsers; the quality of the web pages, the number of visitors and how often they are entered on the page are calculated and valued separately for each web page. Based on a given algorithm, page ratings are calculated one by one for each web page. These algorithms measure and determine the relationship of web pages to each other. Visits to linked pages; they are increasing the likelihood of reaching the values of the links they have established during the visit and the existing web pages linked. In the 90’s, PageRank is an algorithm that creates the original core of Google's search algorithm was evolved as a result of Larry Page and Sergey Brin’s work [7].
PR( A) = (1 − d ) + d * ( PR(T1 ) / C (T1 ) + ... + PR(Tn ) / C (Tn )) (1) PageRank equation showed in Equation (1). PageRank or PR(A) can be estimated using a basic iterative algorithm, and corresponds to the main eigenvector of the standardized web’s link matrix. It should be known that PageRank grades each web page separately and does not see web pages as a whole. It is also stated that the PageRank page of page A is replicated in a recursive manner by the PageRanks of the pages linked to page A. The T i PageRank of the pages linked to Page A does not affect PageRank A of Page A. In the PageRank algorithm, a page's PageRank is always weighted by the count of out links C(T) on the T page. This means that the more out links a page T has, the less will page A draw on a link to it on page T. The weights of PageRank's T i pages are after selected. The result of this is that an extra incoming link for pageA will always raises page A's PageRank. Finally, the total of the weighted PageRanks of all T i pages is producted by a damping factor (d) to set from 0 to 1. Therefore, the extend of PageRank utilization for a page by another page linking to it is droped. 2
http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf1-weighting-1.html
315
In study [8], the PageRank algorithm is modified to correspond to the matrix of the graph where P x (a) is a value of node a in iteration x, d is a damping factor (for web graph orginally set to about 0.85), V is the vertex set whose nodes in the graph, U is a set of nodes with link to node a, D symbolize the set of all dangling nodes and w ij is link weight for each lin k (i, j).
Px ( s ) ∑ Px (u ) * wua (1 − d ) s∈D (2) Px +1 (a ) = + d * (( ∑ )+ V w v u∈U ∑ uv v∈V
In Equation (2), the PageRank value of a page is the total of the PageRank values of the pages referring to this page. The ranking value is published equally for each page on the related pages. Moreover, the canonical parameters "damping factor" and "number of iterations", PageRank [9] calculations depend naturally on most input charts. On PageRank calculations, Thalhammer and Rettigner [10] show a significant effect of the relationship between filtering and weighting. For example, it has been stated that the output of PageRank accounts in RDF version obtained from Wikipedia may be less related to pageview-based sequences than the PageRank accounts in wikipedia linkage chart. III. A PPLICATION OF PAGERANK A LGORIT HM IN LINKED DAT A The DBLP Computer Science Bibliography contains thousands of journals or conferences, and more than 1.8 million academic studies written by over one million authors in these academic studies. DBLP has begun to initially target database systems and logic-oriented programming and grows to include all subsections in the computing domain. In this study, DBLP rdf extension file is used. Because rdf files are used by various applications. It contains these tags and is a format that uses attributes as well as object definitions. Figure 1 shows a cross-sectional view of the DBLP data set ontology (dblp-2018-04-01.rdf) using the Protégé 3 development editor.
Figure 1: Data records in DBLP. Any web browser needs a URL database to find all the URLs in the web. When the web browser performs a crawl operation 3
http://protege.stanford.edu/
International Conference on Advanced Technologies, Computer Engineering and Science (ICATCES’18), May 11-13, 2018 Safranbolu, Turkey for the PageRank operation, it must create a connection queue. Although this sequence is simple, it is an important process for large volume data sets. The successive steps of the practice in this study are given below. The PageRank algorithm, briefly described in Chapter 2, is used to calculate PageRank scores on RDF graphs. Two calculation methods can be used in the PageRank algortihm. The first is the format used to code commonly used RDF graphs such as N-Triples or Turtle. A software was developed that runs the PageRank algorithm. The Eclipse IDE was used for this software. The Output file containing the PageRank results consists of a total of 23412131 lines. In Figure 2, a simple 4-Triples document is taken as input. An N-Triples is a set of RDF terms that represent the subject, predicate, and object of an RDF triples.
Withdrawn Items Total
14 4136275
As stated in Table 1, the DBLP data size is too large. For this reason, PageRank values are calculated according to the specific nodes of the DBLP data set in Table 2. Table 2: PageRank results belongs to some nodes in the DBLP dataset. Node < http://dblp.13s.de/d2r/resource/ journals/twc> http://dblp.13s.de/d2r/resource/ publications/conf/lrec/2010 http://dblp.13s.de/d2r/resource/ publications/phd/us/Kim2008 < http://dblp.13s.de/d2r/resource/ authors/Akemi_Izumiyama>
PageRank 0,53 15,59
Highest ranking
2,83 0,1
Lowest ranking
0,154 0,17
Figure 2: A simple RDF graph. IV. CONCLUSION Figure 3 shows an example of PageRank dissemination. Page A with rank value 1,48 assigns a PageRank of 0,74 to pages B and C. Similary, page B assigns a PageRank of 0,78 to C.
Figure 3:
PageRank calculation on a simple graph.
Table 1 presents the total number of records available in the DBLP dataset for the year 2018 in the context of the latest updated data.
In this study, the pageRank algorithm is applied on DBLP, one of the semantically associated data sets on the connected data cloud. Ranking can be applied to almost every layer of the semantic web. At the lowest levels, triples or XML-related indexes are indexed and ranking, RDF descriptions and document files are ranked, ontology definitions and properties are ranked; and the results are ranked at a higher level. Linked data browser is a program that searches for linked data on the web by following RDF links. In this study, DBLP (Database Systems and Logical Programming) data set is used as Linked Data source. DBLP has gradually expanded to all areas of computer science. An example of pageRank ranking of RDF resources in the DBLP dataset is presented. As a result; The search field has shrunk and the search results have improved. REFERENCES
Table 1: Records in DBLP 4 . Title of Records Number Journal Articles 1669085 Book and Thesis 79461 Data and Artifacts 66 Editorship 37927 Parts in Books or 36272 Collections Informal_Publications 146719 Conference and 2156740 Workshop Papers Reference Works 9991 4
[1] [2] [3] [4] [5] [6]
https://dblp.uni-trier.de/statistics/recordsindblp
316
T . Berners-Lee, J. Hendler, and O. Lassila, The Semantic Web, Scientific American, Vol. 284, No. 5, pp. 34-43, 2001. C. Bizer, T . Heath, and T . Berners-Lee, “Linked data-the story so far”. International Journal on Semantic Web Information Systems, Vol. 5, No .3, pp.1-22, 2009. S. Brin, and L. Page, Anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems. Vol. 30, No. 1, pp.107117, 1998. S. Yumusak, E. Dogdu, and H. Kodaz, “ A short survey of linked data ranking”, ACM SE '14, 2014. R. Mirizzi, A. Ragone, T . D. Noia, and E.D. Sciascio, “ Ranking the Linked Data: The Case of Dbpedia”, ICWE 2010, pp.337-354, 2010. R. Miriziz, A. Ragone, T. D. Noia and E.D. Sciascio, “Semantic tag cloud generation via Dbpedia”, International Conference on Electronic Commerce and Web Technologies, 2010, pp.36-48.
International Conference on Advanced Technologies, Computer Engineering and Science (ICATCES’18), May 11-13, 2018 Safranbolu, Turkey [7]
L. Page, S. Brin, R. Motwani and T . Winograd, “ The PageRank Citation Ranking: Bringing Order to the Web”, 1998, Technical Report, Standford InfoLab. [8] M. Nykl, K. Jezek, M. Dostal, and D. Fiala, “Linked data and PageRankbased Classification”, IADIS International Conference Theory and Practice in Modern Computing 2013, pp.61-64. [9] S. Brin, and L. Page, “ T he Anatomy of a Large-scale Hypertextual Web Search Engine”, Proceedings of the Seventh International Conference on World Wide Web 7, pp. 107–117, 1998. [10] A. T halhammer, and A. Rettinger, “ PageRank on Wikipedia: Towards General Importance Scores for Entities”, ESWC 2016.
317