New Review of Hypermedia and Multimedia
ISSN: 1361-4568 (Print) 1740-7842 (Online) Journal homepage: http://www.tandfonline.com/loi/tham20
Finding the top influential bloggers based on productivity and popularity features Hikmat Ullah Khan & Ali Daud To cite this article: Hikmat Ullah Khan & Ali Daud (2016): Finding the top influential bloggers based on productivity and popularity features, New Review of Hypermedia and Multimedia To link to this article: http://dx.doi.org/10.1080/13614568.2016.1236151
Published online: 04 Oct 2016.
Submit your article to this journal
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=tham20 Download by: [King Abdulaziz University]
Date: 05 October 2016, At: 00:26
NEW REVIEW OF HYPERMEDIA AND MULTIMEDIA, 2016 http://dx.doi.org/10.1080/13614568.2016.1236151
Finding the top influential bloggers based on productivity and popularity features Hikmat Ullah Khana,b and Ali Daudb Department of Computer Science, COMSATS Institute of Information Technology – Wah Campus, Wah Cantt, Pakistan; bDepartment of Computer Science and Software Engineering, International Islamic University, Islamabad, Pakistan a
ABSTRACT
ARTICLE HISTORY
A blog acts as a platform of virtual communication to share comments or views about products, events and social issues. Like other social web activities, blogging actions spread to a large number of people. Users influence others in many ways, such as buying a product, having a particular political or social opinion or initiating new activity. Finding the top influential bloggers is an active research domain as it helps us in various fields, such as online marketing, e-commerce, product search and eadvertisements. There exist various models to find the influential bloggers, but they consider limited features using non-modular approach. This paper proposes a new model, Popularity and Productivity Model (PPM), based on a modular approach to find the top influential bloggers. It consists of popularity and productivity modules which exploit various features. We discuss the role of each proposed and existing features and evaluate the proposed model against the standard baseline models using datasets from the real-world blogs. The analysis using standard performance evaluation measures verifies that both productivity and popularity modules play a vital role to find influential bloggers in blogging community in an effective manner.
Received 26 June 2015 Accepted 9 September 2016 KEYWORDS
Social web; blogger; influence; model; user characteristics; feature
1. Introduction The social web channels offer the participatory features to its users to share their views, provide information and discuss various topics. The social web provides facility to its users to have participation and interaction at the world level. Virtual communities of web forums, blogs and wikis create social networks. A web log, short as blog, is an important form of social networking where users share views and opinions regarding different topics such as products, services, social and political issues. On a blog, a user starts a new topic by initiating a new post, which consists of text, multimedia content and hyperlinks to web pages or even posts. The users access blog content, share their comments and give feedback. The set of blogs on the web is known as the blogosphere. Social interaction facility motivates the researchers to study the social patterns, social concepts and marketing aspects in the social web. A large number of people consult their CONTACT Hikmat Ullah Khan
[email protected]
© 2016 Informa UK Limited, trading as Taylor & Francis Group
2
H. U. KHAN AND A. DAUD
members of the family, colleagues, friends or others before buying something or visiting somewhere or watching a movie. Users whose recommendations are required have been termed as the influential ones (Keller & Berry, 2003). In life, we seek to find the influential persons to help us in various decisions. Due to dynamic online features of the social web, the need to find the influential bloggers in blogs is an important research problem. On technical blogs, a user may focus to find quality content such as looking for a quality comment/answer from a domain expert. In marketing blogs, a user seeks a few trustworthy customers and checks their review or feedback about a product. The recent uplift of social web motivates researchers to discuss social issues and topics discussed in blogs (Agarwal & Liu, 2008). We find research works to find the influential websites and blog sites (Gill, 2004) and to identify a set of key persons in virtual communities (Gliwa, Koźlak, Zygmunt, & Cetnarowicz, 2012; Zygmunt, Brodka, Kazienko, & Kozlak, 2011, 2012). PageRank (Page, Brin, Motwani, & Wingard, 1998) and linkbased information retrieval methods are used for ranking in scholarly network (Ding, 2011). However, it is argued (Agarwal, Liu, Tang, & Yu, 2008, 2012) that the blogosphere provides sparse network and thus link-based approaches are inappropriate for the research in the Blogosphere. Certain factors in the form of features help us to find the expert in a specific domain from the real-world community (Naeem, Bilal, & Tanvir, 2013). The iIndex model (Aggarwal, Lin, & Yu, 2012) finds influential bloggers using salient features. The related work section reviews existing research models in the relevant literature. We propose a model, named Popularity and Productivity Model (PPM), that introduces new features and uses existing important features in the relevant literature (Aggarwal et al., 2012; Akritidis, Bozanis, & Katsaros, 2009; Akritidis, Katsaros, & Bozanis, 2011) and follow a novel modular approach. The use of performance evaluation measures of Spearman correlation and KSim (Haveliwala, 2002) is also a novel approach. Our contributions also include the proposal of the new features of bloggers’ posting behaviour, their ability to stay and post consistently and the average length of the comments. We propose a modular approach and propose two modules of productivity and popularity. We take a unique way to analyse the results of both modules by considering results of each feature, which not only depicts the importance of a feature, but also provides a foundation for the evaluation. In addition, the proposed modules and models are evaluated against existing methods using the standard ranking performance evaluation measures. The rest of the paper is organised as follows: Section 2 reviews the methods in the relevant literature, Section 3 formulates the problem and Section 4 describes the PPM, its modules and framework. Section 5 elaborates experimental set-up, dataset and performance evaluation measures. Before concluding the paper, in Section 6, a comparative analysis using other dataset has been provided.
2. Related work The existing research to find the bloggers in the blogging community either consider bloggers’ features or the network structure of the blogs. We discuss the models of both these techniques separately; however, the proposed model is feature-centric.
NEW REVIEW OF HYPERMEDIA AND MULTIMEDIA
3
2.1. Feature-based models The research related to blogs has recently become an active research domain due to versatility of topics and participation of a huge population of users who actively participate in blog activities. A blog study finds the central users, who may spread information to a large number of people (Jure, Mary, Christos, Natalie, & Matthew, 2007). The concept to find influential blogger is introduced in the influence flow model, known as iIndex (Agarwal et al., 2008). It assumes that active users are also influential. Another model, iFinder (Agarwal et al., 2012), considers features of Recognition (count of inlinks and comments received by a blogger’s posts), Activity Generation (count of posts initiated by a blogger), Novelty (outlinks) and Eloquence (length of comments). The model consists of less number of features, finds influential bloggers who are active as well. It compares the model with ranking algorithms and argues that the link-based ranking algorithms are inappropriate methods to find influential bloggers. This approach lacks to take into account bloggers’ salient features such as activeness and consistency. The iFinder model, in terms of notations used in the paper, is summarised as follows: iFinder(b) = wl ∗Npb + ( Nc + (Ni − No )),
(1)
where wl is the length of blog posts which acts as a weight; Npb is the number of posts initiated by the blogger b; Nc , Ni and No are the number of comments, inlinks and outlinks, respectively received on those blog posts. MEIBI and MEIBIX metrics identify the top influential bloggers (Akritidis et al., 2009). The metrics consider a few features and explore the temporal aspect of the blogger’s activity. The two metrics, BP-index and BI-index (Akritidis et al., 2011), are based on MEIBI and MEIBIX. BI-index measures a blogger’s temporal influence index, while BPindex measures a blogger’s temporal productivity index. H-index (Bui, Nguyen, & Ha, 2014) is a measure to find the influence of a scholar in a scholarly network. Using the h-index approach, top influential bloggers are identified (Bui et al., 2014). H-index is considered as unsuitable for bloggers’ network and has its limitations (Alonso, Caberizo, Herrera-Viedma, & Herrera, 2009). But, h-index does not include all the inlinks, as it considers only h-core values. Due to these limitations, more advances indexes, such as gindex, A-index, and R-index are proposed (Alonso et al., 2009). A recent work (Moh & Shola, 2013) proposes that in addition to a blogger’s role in a blogging network, his/her role should also be compared within other social web communities, such as Facebook and Google + . It thus proposed factors of and FacebookCount and Uniqueness. Akritidas and Bozanis (2014) propose a blog ranking model by presenting temporal and quality features. The model considers the bloggers’ factors in finding the top blogs in the blogosphere. 2.2. Network-based models Social network analysis is an active research domain to analyse the large social networks. It offers various measures to examine the social network characteristics. The proposed centrality-based measures find the important nodes in a network (Brodka, Musial, &
4
H. U. KHAN AND A. DAUD
Kazienko, 2009). Another research work reviews the centrality measures in social networks (Landherr, Friedl, & Heidemann, 2010). It figures out the quality of centrality measures and illustrates their significance. It compares the centrality measures of closeness centrality, betweenness centrality, degree centrality and eigenvector centrality. All such measures are applied to identify important and influential nodes in a social network. Tursov et al. (2010) present the use of social network measures to find such users who influence on others. Using the social network data, the authors find that about 20% of the users influence other users in the network. Research focus shifted to predict the influential users in social networks by ranking the nodes in networks (Ghosh & Lerman, 2010). The predictive ability of various influence models has also been compared using Digg dataset. Digg is the blog that shares the top blog posts from the blogosphere. The results present that non-conservative models are capable of prediction, and alpha-centrality metric is the best indicator of influence. Aral and Walker (2012) present a novel idea to find influential and susceptible members of a social network. Brodka (2012) proposes three algorithms to find key users in a social network using telecommunication data. It aims to find such users who have influence on other members of the network and play their role in network evolution as well. A novel approach was to find temporal effect on the ranking of the key users. Finding top influential nodes in a social network exploiting graph structure is a significant research area in social network analysis (Cha, Haddadi, Benevenuto, & Gummadi, 2010; Goyal, Bonchi, & Lakshmanan, 2008; Weng, Lim, Jiang, & He, 2010). The main application of finding influential users is in online marketing and electronic commerce. In addition, it is found useful in online web service (Weng et al., 2010). Hayashi, Akiba, and Yoshida (2015) proposed to use shortest paths to maintain the edge betweenness value dynamically. The dynamic is a stream of edge insertions/deletions, which is unsuitable for influence computation. The dynamics of influence network are more complex as besides edge insertion/deletions, influence probabilities of edge may evolve over time (Lei, Maniu, Mo, Cheng, & Senellart, 2015). Aggarwal et al. (2012) proposed a model to find such nodes which have the highest influence within a given time span. The model influence propagation is a nonlinear model and is different from conventional linear models (Kempe, Kleinberg, & Tardos, 2003).
3. Problem formulation and problem statement A blogger initiates a new thread and users can share their comments. A blog post is considered influential that takes the attention of a number of bloggers as it motivates other users to create a link or post a comment. A blogger who initiates a large number of such influential blogs is considered as influential blogger. We aim to identify influential bloggers based on certain features of productivity and popularity. The blog topics and semantics of the content are out of the scope of the paper. Given a set B of N bloggers, {b1 , b2 , . . . , bN }, the problem of finding the influential bloggers can formally be defined as determining an ordered subset I of K bloggers, {b j1 , b j2 , . . . , b jk }, ordered according to their influence scores, Sinfl , such that K ≤ Nand K ≤ N, i.e. Sinfl (b j1 ) ≥ Sinfl (b j2 ) ≥ , . . . , Sinfl (b jk ), where the set I contains the K most influential bloggers.
NEW REVIEW OF HYPERMEDIA AND MULTIMEDIA
5
4. Proposed model – PPM The proposed model, PPM, considers the factors of popularity and productivity. The features considered for each module are discussed within each module. The list of symbols used is given in Table 1. 4.1. Popularity module Certain factors are considered as source of influence in the blogosphere. Popularity refers to a blogger’s impact within the blogging community. The number of comments and inlinks received measure a blog’s significance. It has been argued (Akritidis et al., 2011) that a comment can be positive or negative. As the number of comments is considered as a positive factor only and the content is not analysed. Inlinks, on the other hand, show the direct influence and considered as a source of authority in linking algorithms such as PageRank. Outlinks are reversely proportional to novelty as considered in baseline as well and are subtracted from the recognition part. Popularity features are elaborated as follows: Recognition ( f4): The count of comments received by the blog posts of a blogger depicts his/her recognition within the social network. It is denoted as Ncb . Authority ( f5): Ranking algorithms (Page et al., 1998) consider inlinks as a direct measure of authority and provide more importance to inlinks over comments (Akritidis et al., 2009). The authority is denoted as NIb . Novelty ( f6): The existence of outlinks shows the lack of novelty in a content. It is represented by Nob . The baseline model of iFinder considers it as an inverse measure and we take the same approach. For single feature results, we consider bloggers having a number of posts (activity), but sharing less number of outlinks. As taking a less number of outlinks into account returns such bloggers who have no or less number of posts initiated. The existing models (Akritidis et al., 2009; Haveliwala, 2002; Jure et al., 2007) give more importance to inlinks when compared to any other feature. In addition, the statistics given in Table 1. List of symbols used in the paper. Symbol
Remarks
B P b p Npb
Set of Bloggers Set of Blog Posts b[B p[P Number of blog posts posted by a blogger
Ndb
Number of days blog posts posted by a blogger
Sbr
Score of regular posting of a blogger
Nlb
Length of blog posts posted by a blogger
Sba Ncb NIb Nob Sbprod Sbpopu Sbinfl
Score of Average length of the blog posts posted by a blogger Number of comments received on blog posts posted by a blogger Number of Inlinks received on blog posts posted on a blogger Number of outlinks in blog posts posted by a blogger Computed score of blogger b based on the productivity features Computed score of blogger b based on the popularity features Final influence score of blogger b based on all the features
6
H. U. KHAN AND A. DAUD
Table 2. TUAW dataset statistics. Characteristics
Value
Bloggers Posts Inlinks Comments Weblogs Blog post per blogger In-links per blog post Comments per blog post Average length per blog post
51 17,831 53,575 2,67,949 6655 350 3 15 118
Table 2 validate the importance of inlinks over comments on blog posts. The popularity score is calculated using Equation (2), which is given as follows: Sbpopu = wc Ncb + (wI NIb − wo Nob ),
(2)
where wc , wI and wo represents the weights of comments, inlinks and outlinks, respectively. 4.2. Productivity module A blogger who initiates a number of blog posts consistently and regularly is considered as productive. The productivity score is based on following features: Activity (f1): A blogger’s capability to initiate blog post is a significant factor as it depicts his/her main contribution and other factors related to productivity are based on this factor. Also, other users can get influenced by a blogger’s activity. It is represented by Npb . Activeness ( f2): A blogger’s active behaviour is important as discussed in related work. But, earlier work discusses whether active is also influential or not. We here deposit that an active blogger is more influential than a non-active blogger. A blogger, who creates many posts in a short period, but remains inactive for a long period, may not be considered influential. Activeness is calculated by the number of days a blogger remains active in a blog. It is denoted by Ndb . Consistency ( f3): A blogger should be consistent in his posting behaviour to be considered as influential. Consistency is the measure that a blogger post blogs regularly and is denoted by Sbr , and is calculated by dividing the number of posts by the duration period of posting, which is the difference of the last posting date from first posting date divided by 30 to compute month-wise value. PostLength (f7): Postlength is a measure of eloquence of content and that of blogger indirectly. The feature, denoted by symbol Nlb , represents the sum of characters of posts of a blogger b. NormalisedPostLength ( f8): It is argued that sometimes a blogger may post too lengthy content that can give him high postlength score; thus, we introduce normalised comment. The feature, denoted as Sba , is calculated by dividing the accumulative postlength of all the posts of a bloggers b by the count of all his/her posts. The productivity score is computed using Equation (3): b SbPr od = wp Npb + (wd Ndb + wr Sbr ) + (wlen Nlen + wnl Sbnl ),
(3)
NEW REVIEW OF HYPERMEDIA AND MULTIMEDIA
7
where wp is the weight of blogger activity, wd and wr are the weights of activeness and consistency, respectively and wlen and wnl are the weights of PostLength and normalisedPostLength, respectively. 4.3. Influence score Finally, a blogger’s influence score, Sbinfl , is measured by taking weighted sum of the both the modules, using Equation (4), as follows: Sbinfl = wprod Sbprod + wpopu Sbpopu .
(4)
4.4. Use and effects of weights The feature weights are used to regulate the significance of a feature. The weights adjustment approach is based on iFinder method, which is to vary the feature under observation, keeping remaining features constant and take the value when the weight becomes stable. In popularity module, using the equation, we fix two weights and vary third from 0 to 1 and observe how the ranking changes. Fixing weights of inlinks wI and outlinks wo , the comments weight stabilises wc ≥ 0.6. While varying inlinks weight and outlinks weights, the model stabilises for inlinks weight wI ≥ 0.9 and outlinks weights wo ≥ 0.2. It suggests the significance of inlinks, which are also given higher weights in the existing works (Agarwal et al., 2008, 2011, 2012). The statistics in Table 2 also approve that the inlinks are more important than comments. A similar approach finds the weights applied in productivity module where wp represents activity, weight is observed as 0.8, and the wd and wr are found as 0.3 while the weights related to postlength are 0.2. It validates our assumption that activity is the important characteristics to measure a blogger’s productivity while other features depend on activity. For a blogger’s influence score, the modules’ weights for productivity wprod and popularity wpopu are found to be 0.3 and 0.7, respectively. The pseudo code of the algorithm for feature computation reveals that it can be implemented using similar approaches as carried out in existing models and it has similar complexity as first features are computed using database queries and then the model is computed with the help of module scores. The framework of the proposed model is shown in Figure 1. ALGORITHM: Finding top Influential Bloggers Input: A Blog Data Output: top k Influential bloggers 1. Initialise Npb , Nob , Nib , Ndb , Ncp , Nlp 2. For each b [ B 3. FOR each p [ P 4. Npb = Npb + 1
7.
Ndb = CalculateActiveDays() b Sbr = Np ( max ( postdate) − min(postdate)/30) Ncb = CalculateComments(p)
8.
Ncb = CalculateInlinks(p)
9.
Nob = CalculateOutlinks(p)
5. 6.
10.
Nlb = CountChars(p)
8
11.
H. U. KHAN AND A. DAUD
Sba = Nlb / Npb
12. End FOR 13. End FOR 14. ▹ Computation of Modules Score 15. FOR EACH b [ B 16. SbProd = wp Npb + (wd Ndb + wr Sbr ) + (wl Nlb + wa Sba ) 17.
Sbpopu = wc Ncb + (wI NIb − wo Nob )
18.
SbBRank = getTopBloggers(SsBRank )
19. ▹ Computation of Blogger’s Influence Sore 20. Sbinfl = wprod Sbprod + wpopu Sbpopu + wbank SbBrank 21. END FOR 22. STOP ▹END of Algorithm
5. Experimental set-up This section introduces the readers to TUAW dataset, and performance evaluation measures were applied for comparison of the proposed model with existing models. 5.1. TUAW dataset The Unofficial Apple Weblog (TUAW), the blog of Apple, provides option for users to discuss various products and services of Apple. The users are free to share comments on the blog posts, create links to other web pages or blog posts. In this research, TUAW dataset1 is used, which is a widely used dataset in the relevant literature (Agarwal et al., 2008, 2009, 2011). Table 2 presents the dataset characteristics.
Figure 1. Proposed framework for PPM.
NEW REVIEW OF HYPERMEDIA AND MULTIMEDIA
9
5.2. Performance evaluation measures Agarwal, Mahata, and Liu (2014) argue that as influence is a subjective phenomenon and lacks ground truth, so we can only use some features for evaluation and comparison of our methods. Due to the modular approach, the following performance evaluation measures have been used for evaluation. 5.2.1. Spearman’s rank-order correlation Spearman’s rank-order correlation calculates the correlation coefficient between the orders of two parameters. Here, it analyses the correlation between the results of two modules. The equation of Spearman rank-order correlation is given as follows: Spearman Rank Order Correlation = 1 − 6 k(k2 − 1). (5) In our case, we compare the results of both modules taking value of k as 10, 20 and 30. 5.2.2. KSim KSim (Haveliwala, 2002) measures the strength of dependence between two variables. It considers how much variation lies between the two results. It is represented by τ and calculated using the following formula: KSim(t1, t2) =
|(u, v): t ′1 , t′2 agree on order of (u, v), u = v| |U||U − 1|
,
(6)
where u, v represent the blogger in ranking τ1, τ2, respectively. Let U be the union of both of bloggers, and τ1 contains list of bloggers appearing in the U−τ1. Similarly τ2 is yielded from τ1.
6. Results and discussion The results are evaluated and discussed in three phases. At first, top 10 bloggers are found using a single feature to analyse and compare the results in a better manner in the absence of ground truth. Secondly, PPM is compared versus existing models. Lastly, results are discussed with the help of performance evaluation measures. 6.1. Feature-wise results evaluation The feature-wise analysis provides a base for overall evaluation and its results are given in Table 3. S. McNulty is ranked in the first position in three features of Activity, Activeness and comments. E. Sadun enjoys a position among top five ranks in about all the features. S. McNutty and E. Sadun are projected as the top candidates to be considered as influential bloggers. The case of D. Caolo and D. Chartier is notable as both are ranked in top five positions. However, none is ranked in top position in any feature. D. Caolo enjoys relatively higher ranks in more of features when compared to D.Chartier and is expected to have a higher rank than D. Chartier. Considering the ranks of C. Bohon, he is ranked as a top blogger in inlinks, but lacks any notable position among five rankings of any other feature. Now, let us look at the variations in the ranking of the bloggers. S. McNulty is ranked higher when compared to C.K. Sample III, who has
10 H. U. KHAN AND A. DAUD
Table 3. List of the top bloggers based on each feature. Rank 1 2 3 4 5 6 7 8 9 10
F1 Recogn
F2 Author
F3 Novelty
F4 Activity
F5 Activeness
F6 Consistency
F7 Length
F8 Avg length
Scott McNulty Erica Sadun Dave Caolo David Chartier Victor Agreda, Jr. Mat Lu Cory Bohon Michael Rose Mike Schramm Robert Palmer
Cory Bohon Erica Sadun Robert Palmer Dave Caolo Mike Schramm Michael Rose Mat Lu Steven Sande Scott McNulty Brett Terpstra
Brad Hill C.K. Sample, III Michael Sciannamea Greg Scher Dori Smith David Touve Marc Orchant Damien Barrett Jan Kabili Robert Palmer
Scott McNulty Dave Caolo David Chartier Erica Sadun C.K. Sample, III Mat Lu Laurie A. Duncan Cory Bohon Michael Rose Mike Schramm
Scott McNulty Dave Caolo David Chartier Erica Sadun Michael Rose Mat Lu Cory Bohon Laurie A. Duncan Mike Schramm C.K. Sample, III
Barb Dybwad David Chartier Sean Bonner C.K. Sample, III Erica Sadun Scott McNulty Dave Caolo Robert Palmer Mat Lu Mike Schramm
Erica Sadun David Chartier Scott McNulty Dave Caolo Mat Lu Michael Rose C.K.Sample, III Laurie A. Duncan Cory Bohon Mike Schramm
Weblogs, Inc. Chris Ullrich Pariah S. Burke Jason Clarke Christina Warren Brett Terpstra Scott Granneman Joshua Ellis Caryn Coleman Mat Lu
NEW REVIEW OF HYPERMEDIA AND MULTIMEDIA
11
more number of ranking variations. D. Caolo and D. Chartier both enjoy similar overall ranks, but differ in case of inlinks, which is an important feature. 6.2. Comparative analysis of PPM with iFinder model Table 4 presents the results of top influential bloggers ranked by PPM and existing models. As iFinder (Aggarwal et al., 2012) is an extension of iIndex (Agarwal et al., 2008), so iFinder is taken as a baseline model. McNutty enjoys top ranks in both the proposed modules; however, iFinder does not rank him in top 10 positions even. This result reveals that the baseline model ranks C. Bohon as top blogger, while feature-wise analysis presents that he is ranked seventh in activeness, eighth in activity and does not appear in the top five positions except inlinks. It shows that the iFinder gives too much significance to inlink, whereas PPM provides importance to all the features. It can be argued that inlink can be in support as well it can against the blog as a feedback to that blog so it should not be given too much importance. Both popularity and popularity modules do not rank C. Bohon among high ranks. E. Sadun enjoys second rank, as expected, in the proposed method, but she is not ranked among the top 10 results by the baseline method but popularity give her top ranks. Both D. Caolo and D. Chartier are ranked third and fourth, respectively by PPM as per our anticipation, but the baseline ranks them in lower ranks. C. Bohon, the top blogger by iFinder, is ranked fifth by PPM and is ranked in similar positions in the feature-wise analysis. 6.3. Comparative analysis of PPM with MEIBI and MEIBIX Metrics Table 4 presents a comparison of PPM with existing metrics of MEIBI and MEIBIX. We discuss cases of top three bloggers. Both metrics rank C. Bohon based on his high number of inlinks. R. Palmer enjoys second position in metrics, but he enjoys third in inlinks and does enjoy any rank in other features. The case of Steven Sande provides a better comparison as he is ranked third by both the metrics, but he does not enjoy any ranking in any feature except eighth position in inlink feature. Therefore, our approach does not rank him in the top 10 positions. The results using the performance evaluation measures suggest that the overall top 10 bloggers are usually common in modules and the existing Table 4. Top 10 bloggers based on productivity, popularity, baseline and combined model. Rank
Productivity
Popularity
PPM
iFinder
MEIBI
MEIBIX
1 2 3 4 5
Scott McNulty Dave Caolo David Chartier Erica Sadun C.K. Sample, III
Scott McNulty Erica Sadun Dave Caolo Cory Bohon David Chartier
Scott McNulty Erica Sadun Dave Caolo David Chartier Cory Bohon
Cory Bohon Robert Palmer Mat Lu Christina Warren Dave Caolo
Cory Bohon Robert Palmer Steven Sande Erica Sadun Micheal Rose
Cory Bohon Robert Palmer Steven Sande Erica Sadun Christina Warren Micheal Rose
6
Mat Lu
Victor Agreda, Jr.
Chris Ullrich
Mike Schramm
7
Laurie A. Duncan
Mat Lu
Victor Agreda, Jr. Mat Lu
Steven Sande
8 9
Cory Bohon Michael Rose
Michael Rose Mike Schramm
Michael Rose Mike Schramm
Mat Lu Dave Caolo
10
Mike Schramm
Robert Palmer
Robert Palmer
Michael Rose Victor Agreda, Jr. Jason Clarke
Christina Warren Dave Caolo Mat Lu Brett Terpstra
Brett Terpstra
Mike Schramm
12
H. U. KHAN AND A. DAUD
Table 5. Ksim results for comparison of proposed vs. existing approaches. Pearson Rank-Order Results
KSim Results
Comparison between
Top 30
Top 20
Top 10
Top 30
Top 20
Top 10
Productivity vs. MEIBI Productivity vs. MEIBIX Popularity vs. MEIBI Popularity vs. MEIBIX PPM vs. MEIBI PPM vs. MEIBIX
0.26 0.26 0.36 0.36 0.48 0.48
0.23 0.25 0.26 0.28 0.36 0.37
0.23 0.26 0.27 0.29 0.34 0.39
0.16 0.19 0.18 0.6 0.8 0.8
0.15 0.17 0.19 0.26 0.31 0.36
0.19 0.2 0.21 0.27 0.29 0.35
metrics but the variation in ranks is notable. The results in Table 5 present a higher correlation for PPM with modules when compared to both modules separately, and MEIBIX is more correlated when compared to MEIBI. In addition, modules present a large number of variations in the rank order for each blogger as evident from KSim results. The proposed method has different results than iFinder, as Pearson and KSim are less than 0.1 even for all the values of k; therefore, only feature-wise results have been given. The main difference is that the metrics consider too much importance to inlinks while the proposed modules give importance to all the features.
6.4. Comparative analysis of productivity and popularity Figure 2 presents the comparison of productivity and popularity modules. The results examination reveals that productivity and popularity features are consistent. The difference between the PPM and the productivity module proves that merely initiating number of posts is not a proper measure of influence. Popularity results are similar to PPM depicting its role to find influence. An analysis of Figure 2 validates that both productivity and popularity are important features as both contribute effectively to identify the top influential bloggers. The middle line displays the ranks of PPM, while the results of productivity and popularity are compared with it. The resultant ranks of the
Figure 2. Comparative analysis of productivity and popularity.
NEW REVIEW OF HYPERMEDIA AND MULTIMEDIA
13
popularity are closer to PPM and the results obtained by the productivity are relatively ranked below PPM trend-line.
6.5. Evaluation using performance evaluation measures One of our main contributions is the evaluation of the both modules results using the performance evaluation measures, which is not applied except (Akritidis et al. 2011), in which Pearson Correlation between results is computed. Table 5 presents the comparative analysis considering top k (i.e. 10, 20, 30) bloggers and for the entire dataset. Pearson rankorder correlation finds the correlation coefficient between the results of the productivity and popularity and PPM. Popularity is more correlated to PPM, as it evaluates blogger’s influence when compared to productivity for all the values of the k. KSim measures the variations in the values in two lists or the results of two algorithms. The KSim results of top k (10, 20, 30 and for the entire dataset) results have been given in Table 6. KSim correlation shows a less number of variations in both productivity and the popularity modules. It is notable that this measure has the similar results as those of Pearson rank-order correlation. The results of productivity and popularity have similar values for top 30 bloggers, which signifies that both productivity- and popularity- based features are important and contribute to identification of the top influential bloggers in a blogging community.
7. Analysis of PPM using Engadget dataset In addition to TUAW Dataset, PPM is evaluated using Engadget dataset. To avoid the confusion of bloggers’ discussion in content, the analysis using each dataset has been elaborated separately.
7.1. Engadget dataset The Engadget blog2 covers electronic gadget and shares the news about the technology topics, where common users can discuss on technology topics. Engadget is one of the top blogs as it is awarded as the best Tech Blog for two years by the Bloggie and it has been ranked among top five blogs in the world as declared by Time magazine in 2010.3 According to Wikipedia,4 in year 2013, Engadget is ranked in top five blogs in the Blogosphere as ranked by the Technorati, a world-renowned publisher advertising platform. Engadget dataset is available free of cost5 and is widely used in the recent relevant research work (Akritidis et al., 2009, 2011). The characteristics of the dataset are given in Table 7. Table 6. Pearson rank-order correlation results. Pearson Rank-Order Results Comparison between Productivity vs. Popularity score Productivity vs. PPM Popularity vs. PPM
KSim Results
Top 30
Top 20
Top 10
Top 30
Top 20
Top 10
0.86 0.87 0.99
0.83 0.85 0.99
0.83 0.86 0.99
0.62 0.65 0.97
0.62 0.63 0.98
0.64 0.68 0.95
14
H. U. KHAN AND A. DAUD
Table 7. Engadget dataset characteristics. Characteristics
Engadget
Bloggers Blog posts In-links Comments Blog post per blogger In-links per blog post Comments per blog post Average length per blog post
93 63,358 319,880 3,672,819 681 5 58 180
7.2. Comparison of PPM using Engadget with baseline metrics PPM is compared with the recent metrics of BI-index and BP-index. An analysis of Table 8 presents that the top rank of blogger Murph D. is consistent in the baseline as well as in module and PPM, which verifies that the PPM and its modules find the top blogger. But the case of Ziegler C. is noteworthy. Ziegler’s rank is high by the baseline methods, but the proposed methods do not rank him highly. As in single feature-wise analysis, he is ranked eighth in activity, eighth in inlinks and ninth in comments, which are the two important features to measure a blogger’s influence. The comparison of Miller P. and Ricker T. is noticeable as both enjoy as in the feature-wise analysis, both enjoy ranks in the top five positions, but Miller P. enjoys relatively better rank positions in the ordering in the features of inlinks, activity and comments. The baselines rank Ricker T. higher than Miller P., while the modules of productivity and blog rank as well as PPM rank Miller P. when compared to Ricker T., which is proper ordering. The blogger analysis reveals that the proposed metric PPM and modules identify the top bloggers effectively. Table 9 presents the comparative analysis of the proposed method, its modules and the baselines using the performance evaluation measure. Osim measures the overlapping similarity of the two orders. The high values of overlapping similarity suggest that overall Table 8. A comparison of PPM, modules versus baselines. Rank
BI-Index
1 2 3 4 5 6 7 8 9 10
Murph D. Ziegler C. Savov V. Miller P. Flatley J. Stevens T. Stern J. Ricker T. Topolsky J. Miller R.
BP-Index
Productivity
Popularity
PPM
Murph D. Ziegler C. Savov V. Ricker T. Topolsky J. Miller P. Stevens T. Flatley J. Miller R. Stern J.
Murph D. Block R. Miller P. Rojas P. Melanson D. Ricker T. Patel N. Topolsky J. Ziegler C. Blass E.
June L. Murph D. Ricker T. Miller P. Patel N. Topolsky J. Block R. Melanson D. Ziegler C. Miller R.
Murph D. Miller P. Ricker T. Block R. June L. Melanson D. Patel N. Topolsky J. Ziegler C. Rojas P.
Table 9. A comparison of PPM and baseline metrics. Activity vs. PPM Recognition vs. PPM BlogRank vs. PPM BP-Index vs. PPM BI Index vs. PPM
OSim
Spearman’s Correlation
Kendall’s Correlation
0.90 0.90 0.90 0.60 0.50
0.80 0.43 0.87 0.55 0.45
0.68 0.38 071 0.37 0.32
NEW REVIEW OF HYPERMEDIA AND MULTIMEDIA
15
Table 10. Techcrunch dataset characteristics. Characteristics
Engadget
Bloggers Bog Posts In-links Comments Blog post per blogger In-links per blog post Comments per blog post Average length per blog post
107 19,464 193,808 746,561 181 10 38 169
bloggers identification is similar. The rank-order correlation finds the correlation between the two orders, whereas the Kendall correlation finds the alterations in ordering of the two rankings. The correlation results are relatively different, which suggest that the order of the ranks of the methods vary, which is understandable as the baselines uses three main features of activity, in-links and comments along with temporal factor while the proposed metric uses as many as nine different features. The overall similarity of modules is high when compared to the baselines.
7.3. Techcrunch dataset Techcrunch6 is a world-renowned web log for sharing news and articles related to information-technology companies and business organisations. According to Wikipedia,7 as many as about two million users access the blog every month, which reveals the volume of the data of the blog. Techcrunch dataset containing features of bloggers is available free of cost8 and has been used in the recent relevant papers (Akritidis et al., 2011). Table 10 presents the characteristics of the dataset.
7.4. Comparison of PPM using Techcrunch data with existing metrics Table 11 presents the top 10 bloggers ranked by the h-index, PPM and its modules. A careful analysis reveals that the top bloggers in the Techcrunch blog. The overall common bloggers’ count is high, considering both proposed and baseline methods. Kendall similarity measure depicts the alterations in ordering in the compared lists and higher values shows that the ordering by the baseline and compared results is high, whereas the correlation in the results is higher as well. In the nutshell, PPM is compared Table 11. Blogger’s ranking based on PPM, modules and h-index. Rank
H-index
Productivity module
Popularity module
PPM
1 2 3 4 5 6 7 8 9 10
Michael A. Erick S. MG Siegler Duncan R. Jason K. Mark H. Robin W. Leena R. Marshall K. Guest A.
Michael A. Erick S. Jason K. MG Siegler Robin W. Leena R. Duncan R. Mark H. Guest A. John B.
Michael A. MG Siegler Erick S. Jason K. Robin W. John B. Guest A. Leena R. Mark H. Sarah L.
Michael A. Erick S. Jason K. Robin W. Leena R. MG Siegler Guest A. Duncan R. John B. Mark H.
16
H. U. KHAN AND A. DAUD
using the three widely used datasets of TUAW, Engadget and Techcrunch and the results are compared with existing methods in the relevant literature. It shows that both popularity and productivity are important to measure a blogger’s impact within the blog community.
8. Conclusion In the paper, we propose that productivity and popularity are essential perspectives for finding top influential users. The proposed features relate to bloggers’ popularity and productivity characteristics. Single feature-wise analysis formulates the scenario to accomplish the comparative analysis of the productivity, popularity and the proposed model, PPM. The model is compared against the baseline model and existing metrics using real-world blog data and the results validate that the PPM identifies the influential bloggers effectively. Performance evaluation measures are used for comparative analysis. In the future, we intend to further enhance the model by the introduction of more modules and evaluation of sentiment features to validate whether influential bloggers are also those bloggers who share positive content. In addition, we intend to use similar concepts and features for the new dataset, especially for finding quality content in an online forum where people share their comments in the various languages and discuss on various topics. As a future work, we intend to find the influential bloggers using the association rule mining (Erlandsson, Brodka, Brog, & Johnson, 2016).
Notes 1. 2. 3. 4. 5. 6. 7. 8.
TUAW Dataset: http://users.sch.gr/lakritid/code.php?c=2, Accessed April 30, 2016. http://www.engadget.com/. Accessed April 8, 2016. https://en.wikipedia.org/wiki/Engadget. Accessed April 9, 2016. https://en.wikipedia.org/wiki/Engadget. Accessed April 15, 2016. http://users.sch.gr/lakritid/code.php?c=3. Accessed April 10, 2016. http://techcrunch.com/. Accessed April 8, 2016. https://en.wikipedia.org/wiki/TechCrunch. Accessed April 9,2016. Accessed April 08, 2016. http://users.sch.gr/lakritid/code.php?c=4. Accessed April 8, 2016.
Disclosure statement No potential conflict of interest was reported by the authors.
References Agarwal, N., & Liu, H. (2008). Blogosphere: Research issues, tools, and applications. ACM SIGKDD Explorations Newsletter, 10(1), 18–31. doi:10.1145/1412734.1412737 Agarwal, N., Liu, H., Tang, L., & Yu, S. (2008). Identifying the influential bloggers in community. International conference on web search and data mining. Palo Atlo, CA. Agarwal, N., Liu, H., Tang, L., & Yu, P. (2012). Modeling blogger influence in a community. Social Network Analysis and Mining, 2(2), 139–162. doi:10.1007/s13278-011-0039-3 Agarwal, N., Mahata, D., & Liu, H. (2014). Time and event driven modeling of blogger influence. In Encyclopedia of social network analysis and mining (ESNAM) (pp. 2154–2165). New York, NY: Springer.
NEW REVIEW OF HYPERMEDIA AND MULTIMEDIA
17
Aggarwal, C., Lin, S., & Yu, P. (2012). On influential node discovery in dynamic social networks. SIAM International conference on data mining, California, USA. Akritidis, L., & Bozanis, P. (2014). Improving opinionated blog retrieval effectiveness with quality measures and temporal features. World Wide Web, 17(4), 777–798. doi:10.1007/s11280-0130237-1 Akritidis, L., Bozanis, P., & Katsaros, D. (2009). Identifying influential bloggers: Time does matter. Proceeding of Web Intelligence and Intelligent Agent Technologies (WI-IAT) (Vol. 1, pp. 76– 83). Milan, Italy. Akritidis, L., Katsaros, D., & Bozanis, P. (2011). Identifying the productive and influential bloggers in a community. IEEE Transactions on Systems Man and Cybernetics Part C (Applications and Reviews), 41(5), 759–764. doi:10.1109/TSMCC.2010.2099216 Alonso, S., Caberizo, F., Herrera-Viedma, E., & Herrera, F. (2009). H-index: A review focused in its invariant, computations and standardization for different scientific fields. Journal of Informetrics, 3, 273–289. doi:10.1016/j.joi.2009.04.001 Aral, S., & Walker, D. (2012). Identifying influential and susceptible members of social networks. Science, 337(6092), 337–341. doi:10.1126/science.1215842 Brodka, P. (2012). Key users in social network. How to find them? Saarbrücken, Germany: LAP Lambert Academic Publishing. Brodka, P., Musial, K., & Kazienko, P. (2009). A performance of centrality calculation in social networks. Proceedings of International Conference on Computational Aspects of Social Networks, Fontainbleu, France, (pp. 24–31). Bui, D.-L., Nguyen, T.-T., & Ha, Q.-T. (2014). Measuring the influence of bloggers in their community based on the H-index family. Paper presented at the 2nd International Conference on Computer Science, Applied Mathematics and Applications (ICCSAMA) (Vol. 282, pp. 313– 324). Budapest, Hungary. Cha, M., Haddadi, H., Benevenuto, F., & Gummadi, P. (2010). Measuring user influence in twitter: The million follower fallacy. International Conference on Web and Social Media (ICWSM), (pp. 10–17). Washington, DC, USA. Ding, Y. (2011). Applying weighted pagerank to author citation networks. Journal of the American Society for Science and Technology, 6(2), 236–245. doi:10.1016/j.ipm.2010.01.002 Erlandsson, F., Brodka, P., Brog, A., & Johnson, H. (2016). Finding influential users in social media using association rule learning. Entropy, 18(5), 164. doi:10.3390/e18050164 Ghosh, R., & Lerman, K. (2010). Predicting Influential users in online social networks. Paper presented at the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC, USA. Gill, K. (2004). How can we measure the influence of the Blogosphere? Proceedings of the WW Workshop on the Weblogging ecosystem: Aggregation, analysis and dynamics. New York, USA. Gliwa, B., Koźlak, J., Zygmunt, A., & Cetnarowicz, K. (2012). Models of social groups in blogosphere based on information about comment addressees and sentiments. Proceedings of the 4th international conference on social informatics. Lausanne, Switzerland. Goyal, A., Bonchi, F., & V. S. Lakshmanan, L. (2008). Discovering leaders from community actions. Proceeding of 17th Conference on Information and Knowledge Management (CIKM), Nepa Valley, CA, (pp. 499–508). Haveliwala, T. (2002). Topic-sensitive pagerank. Paper presented at the 11th International conference on World Wide Web (pp. 517–526). New York: ACM. doi:10.1145/511446.511513 Hayashi, T., Akiba, T., & Yoshida, Y. (2015). Fully dynamic betweenness centrality maintenance on massive networks. VLDB Endowment, 9(2), 48–59. Jure, L., Mary, M., Christos, F., Natalie, G., & Matthew, H. (2007). Cascading behavior in large blog graphs. SIAM International conference on data mining (pp. 202–209). Minneapolis, MN, USA. Keller, E., & Berry, J. (2003). The influentials: One American in ten tells the other nine how to vote, where to eat, and what to buy. New York, USA: The Free Press. Kempe, D., Kleinberg, J., & Tardos, É. (2003). Maximizing the spread of influence through a social network. Paper presented at the 9th ACM SIGKDD international conference on knowledge discovery and data mining. Washington, DC, USA.
18
H. U. KHAN AND A. DAUD
Landherr, A., Friedl, B., & Heidemann, J. (2010). A critical review of centrality measures in social networks. Business & Information Systems Engineering, 2(6), 371–385. doi:10.1007/s12599-0100127-3 Lei, S., Maniu, S., Mo, L., Cheng, R., & Senellart, P. (2015). Online influence maximization. Paper presented at the 21st ACM SIGKDD international conference on Knowledge discovery and data mining. Paris, France. Moh, T.-S., & Shola, S. (2013). New Factors for Identifying Influential Bloggers. IEEE International conference on big data. Silicon Valley, CA, USA. Naeem, M., Bilal, M., & Tanvir, M. (2013). Expert discovery: A web mining approach. Journal of AIand Data Mining, 1(1), 35–47. Retrieved from http://jad.shahroodut.ac.ir/article_116_ a4340c56079eae6e7078c17de68a6460.pdf Page, L., Brin, S., Motwani, R., & Wingard, T. (1998). The pagerank citation ranking: Bringing order to the Web. Stanford Digital Library Technologies Project. Retrieved from http://ilpubs.stanford. edu:8090/422/1/1999-66.pdf Trusov, M., Bodapati, A., & Bucklin, R. (2010). Determining influential users in internet social networks. Journal of Marketing Research, 47(4), 643–658. doi:10.1509/jmkr.47.4.643 Weng, J., Lim, E., Jiang, J., & He, Q. (2010). Twitterrank: Finding topic-sensitive influential twitterers. Third ACM international conference on Web search and data mining. New York, USA. Zygmunt, A., Brodka, P., Kazienko, P., & Kozlak, J. (2011). Different approaches to group and key person identification in Blogosphere. International conference on advances in social analysis and mining. Kaohiung, Taiwan. Zygmut, A., Brodka, P., Kazienko, P., & Kozlak, J. (2012). Key person analysis in social communities within the blogosphere. Journal of Universal Computer Science, 18(4), 577–597. doi:10.3217/jucs018-04-0577