Identifying Implicit Relationships between Social Media Users to Support Social Commerce Christopher C. Yang
Xuning Tang
College of Information Science and Technology Drexel University
College of Information Science and Technology Drexel University
[email protected]
[email protected]
Haodong Yang
Ling Jiang
College of Information Science and Technology Drexel University
College of Information Science and Technology Drexel University
[email protected]
[email protected]
ABSTRACT The Internet is an ideal platform for business-to-consumer (B2C) and business-to-business (B2B) electronic commerce where businesses and consumers conduct commerce activities such as searching for consumer products, promoting business, managing supply chain and making electronic transactions. With the advance of Web 2.0 technologies and the popularity of social media sites, social commerce offers new opportunities of social interaction between electronic commerce consumers as well as social interaction between consumers and e-retailers. The user contributed content provides a tremendous amount of information that may assist in electronic commerce services. Social network analysis and mining has been a powerful tool for electronic commerce vendors and marketing companies to understand the user behavior which is useful for identifying potential customers of their products. However, the capability of social network analysis and mining diminishes when the social network data is incomplete, especially when there are only limited ties available. The social networks extracted from explicit relationships in social media are usually sparse. Many social media users who have similar interest may not have direct interactions with one another or purchase the same products. Therefore, the explicit relationships between electronic commerce users are not sufficient to construct social networks for effective social network analysis and mining. In this work, we propose the temporal analysis techniques to identify implicit relationships for enriching the social network structure. We have conducted an experiment on Digg.com, which is a social media site for users to discover and share content from anywhere of the Web. The experiment shows that the temporal analysis techniques outperform the baseline techniques that only rely on explicit relationships.
Categories and Subject Descriptors H.2.8 [Database Management]: Database applications---Data mining; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; H.5.4 [Information Interfaces and Presentation]: Hypertext/Hypermedia; I.5.4 [Pattern Recognition]: Applications
ICEC '12, August 07 - 08 2012, Singapore, Singapore Copyright 2012 ACM 978-1-4503-1197-7/12/08…$10.00.
General Terms Algorithms, Measurement, Experimentation, Human Factors
Keywords User Activeness, Explicit Relationship, Implicit Relationship, Social Media Analytics, Web 2.0, Temporal Analysis, Spectral Analysis, Spectral Coherence, Social Commerce, Social Network Analysis and Mining
1. INTRODUCTION The Internet is an ideal platform for business-to-consumer (B2C) and business-to-business (B2B) electronic commerce where businesses and consumers conduct commerce activities such as searching for consumer products, promoting business, managing supply chain and making electronic transactions. The success of many electronic commerce retailers such as Amazon and E-Bay has proven that the Internet is an ideal platform for electronic commerce. In recent years, following the success of MySpace, Facebook and Twitter, social media has drawn significant attention in electronic commerce. It is important to identify how we may use social media to facilitate the electronic commerce services and stimulate the electronic commerce transactions [21][22][24]. Social commerce, which is considered as part of electronic commerce, is drawing increasing interest from the academic and industry in developing new theories and technologies to understand the user behavior of social media and extracting knowledge from the user-contributed content and social network structure. Social network analysis and mining is powerful in identifying influential users, discovering special interest subgroups, determining user roles, and understanding community evolution [20][23][25][26]. These techniques help us to understand the development of electronic commerce user interest for matching products with potential consumers. However, social network analysis and mining is useful in achieving these facilitations only if we have a social network that captures the ties between actors as input [20][23]. In conducting social network analysis and mining on social media, we always face a challenge of capturing the user relationships of common interest. Social media is a large space that involves a tremendous number of users and interactions. The explicit interactions between users may only
represent part of the user relationships but many users who have common interests are not necessarily interacting with each other explicitly. The user relationship can be classified as two types of relationships, namely explicit relationship and implicit relationship [19]. On one hand, the explicit relationship is defined as the relationship that can be captured by the user direct interaction through the functions in social media sites. For example, a Web forum user A makes a comment on a post made by another user B; a Twitter user A retweets a tweet posted by another Twitter user B; a Facebook user A accept a “friend request” from another Facebook user B. These explicit interactions between two social media users indicate a relationship between them because of their common interest in a particular content/action or an intention of building a close relationship. On the other hand, the implicit relationship corresponds to the relationship drawn from the user coherent interest or activity that cannot be traced by any public records of interactions between two users in a social media site. Extracting implicit relationships between users is relatively more challenging. Two users who have common interest on certain content are behaving similarly to follow the related events; however, these two users may not have any direct interaction with each other. For example, two users who are interested in a particular movie may post comments or tweets about their opinions on the movie when they see the trailers or news of this movie on the Web. However, these two users may not have any direct interactions to comment on each other’s opinions. From the electronic commerce perspective, identifying the implicit relationship between users provide valuable information in marketing and recommending consumer products to potential consumers that overcome the sparseness of ties in social network constructed by explicit relationships. In this paper, we propose the temporal analysis approach to address the problem of identifying implicit relationships between online social media users. An experiment is conducted to compare the performance of the proposed implicit relationship identification techniques with the explicit relationship extractions. It shows that the proposed implicit relationship identification techniques obtain a higher F-1 measure value.
2. RELATED WORK In business applications, social network is widely used to represent customer relationships, buyer-seller relationships or buyer-supplier relationships. For instance, in a customer relationship management system, customers and their relationships can be represented by a social network where each node denotes a customer and each edge corresponds to a relationship between two customers. In most of the situations, the network structure, especially for the links of a network, are declared by the users. For example, in online social media sites, such as Facebook and Twitter, users can explicitly follow each other or accept a friendship request from another user. However, these explicit relationships are not always publically available. As a result, there is a research need to acquire implicit relationships between users. Link prediction is a related research topic of inferring implicit relationships among users. Liben-Nowell and Kleinberg defined the link prediction problem as inferring missing interactions among users which are highly probable to occur in the future [11]. They further summarized the existing link prediction approaches into neighborhood-based link prediction approach and path-
ensemble based link prediction approach [11]. For neighborhoodbased link prediction approach, Newman computed the probability of collaboration between two users by counting the common neighbors of two nodes in a collaborative network [16]; Adamic and Adar employed a score function which refines the simple counting of common neighbors by weighting rarer neighbors more heavily [1]. Different from the neighborhoodbased link prediction approaches, the path-ensemble based approach take into account the distance between two nodes. Katz proposed a formula to measure the probability of future link which sums over the paths between two nodes by weighting shorter paths more heavily [10]. Liben-Nowell and Kleinberg proposed a normalized and symmetric version of hitting time to tackle the link prediction problem [11], where hitting time was defined as the expected time in a random walk to reach a node starting from another node. A shorter hitting time corresponds to a higher proximity between two users. Besides the neighborhood and path-ensemble based approaches, some recent works focused on training a classifier which takes proximity, aggregation and topological features into account to conduct link prediction [2, 12]. However, most of the existing link prediction techniques need a social network as an initial input and their objective is predicting future interactions given the current network snapshot. In our problem, we assume such an initial social network is unavailable because explicit connections between users are missing or incomplete. Instead, we utilize temporal analysis to identify the implicit relationship between social media users. Some previous techniques discover implicit relationships based on user similarity. McPherson et al. introduced the notion of homophily and considered that social relationships are likely to form between people of similar characteristics [13]. Provost et al. introduced methods for extracting “quasi-social network” from data on visitations to social networking pages [17]. A link is placed between two users in this “quasi-social network” depending on their visits to the common web pages. Similarly, proximity can also be interpreted as interaction in physical spaces [4]. However, this explicit quasi-relationship is not always available in all social media sites. In this work, we propose to use only temporal data to extract implicit relationship regardless of their specific activities such as visiting the same web pages or commenting on the same web objects. In this work, we propose to represent user activeness by a time series, namely user feature vector, and investigate user behavior by spectral analysis techniques. Spectral coherence score between two user feature vectors is used to represent their potential relationships. Signal processing techniques have been investigated in a wide variety of fields including genetics [15], economics [7], and neuroscience [6, 9]. Signal processing techniques were also applied to extract semantically related search engine queries [5] and cluster words in news streams to detect popular events [8]. In this work, we employ the signal processing techniques to analyze the user frequency of social medial activities and determine the implicit relationship between any two users even if these two users do not have any explicit interaction or relationship.
3. METHOD 3.1 Temporal Analysis Direct user interactions may only be useful in extracting the explicit relationships but are not adequate to identify the implicit relationships. Indeed, the implicit relationship cannot be determined by simply tracing the one-to-one interaction between users. By identifying and integrating implicit relationships, we
are able to obtain a more comprehensive view and achieve better social media analytic performance. In this section, we propose the temporal analysis research framework to identify implicit relationship from online social media.
3.1.1 Temporal Coherence Analysis General speaking, web contents nowadays are generated mostly by web users. External events, such as the release of new iPhone 4s or the anniversary sale of a famous brand, will trigger a mass of web contents contributed by web users with common interest. As a result, even though two users do not explicit interact with each other, as long as they react similarly to common external events, it is possible that these two users might share common interest or have an implicit relationship. In other words, we assume that an implicit relationship exists between two users when their daily activeness has strong temporal coherence. The objective of temporal coherence analysis in this section is calculating the temporal similarities between any pair of users, which will result in a similarity matrix to represent the strength of implicit relationships between users. To quantify the similarity between any pair of users, say i and j, we first represent them by user feature vectors, then we compute auto-spectrum for each individual vector and cross-spectrum of i and j, finally the spectral coherence of i and j is employed to quantify their similarity. User Feature Vector. In our work, users are represented by vectors. Given any online social media, let T be the period (in days) during which we investigate user behavior and interaction. For each user, we represent his/her activeness by a vector defined as below: ( )
( )
( )
(1)
where each element ( ) represents the activeness of user on the ith day. ( ) can be defined in a very flexible manner according to the context. Several attributes can be used to quantify user activeness depending on which online social media we are studying, for example, the number of messages a user post daily, the number of URL a user tag daily, the number of tweets a user post or retweet daily, or the number of video a user click and comment daily. In a simple manner, ( ) can be defined as: ()
() ()
(2)
where ( ) is the number of messages posted by user u on day , ( ) is the total number of messages posted by all users including u on day . It is worth mentioning that ( ) can be defined in more delicate ways when more user information is given. In this paper, we focus our research on studying the implicit relationships among users in online social media (data from Digg.com will be used as our test bed in the experiment). Given all the user contributed content or user actions, we define ( ) as: ()
() ()
( )
(3)
where ( ) is the number of messages (stories or comments) contributed by user u on day , ( ) is the total number of messages contributed by all users including u on day , is the total number of unique threads that user participated (submitted or made comments to) over time , and S is the total number of unique threads over T. Equation (3) consists of two components:
() ()
and
( ).
() ()
measures the number of messages that user
u contributes on day normalized by the total number of messages of day . The higher ( ) is, the more messages that u contributes, leading to a larger activeness of u at day . ( ) penalizes a user if s/he participated in too many different stories which may imply that this user does not have a specific focus. In this study, we consider a group of m users form an mdimensional multivariate process. By employing the user feature vector defined above, the m-dimensional multivariate process can be denoted as A, where each row denotes a user and each column indicates the activeness of these users on a specific day: (
)
(
( )
( )
( )
( )
)
(4)
3.1.2 Spectral Analysis User behaviors in terms of multivariate time series are often rich in oscillatory content, leading them naturally to spectral analysis. By computing the spectrum of user , we can quantify the overall activeness of a user in online social media. To calculate spectral estimates of users (auto- or cross-spectrum and coherence), we first perform a Fourier transform on each . Given a finite length user feature vector of discrete time process ( ), t=1, 2, …, T, the Fourier transform of the data sequence ̃ ( ) is defined as follow: ̃( )
∑
( )
(
)
(5)
A simple estimate of the spectrum is taking the square of the Fourier transform of the data sequence, i.e., | ̃ ( )| . However, this suffers from the difficulties of bias and leakage. To resolve these issues, we applied multitaper technique to obtain smoother Fourier-based spectral density with reduced estimation bias [14]. In the multitaper technique, we apply K tapers successively to the ith user feature vector and take the Fourier transform: ̃(
)
∑
( ) ( )
(
)
(6)
where ( ) (k = 1, 2, …, K) represent K orthogonal taper functions with appropriate properties. A particular choice of these taper functions, with optimal leakage properties, is given by the discrete prolate spheroidal sequences (DPSS) [18]. The multitaper estimates for the spectrum ( ) is defined as: ( )
∑
|̃(
)|
(7)
Since the spectrum score of each user varies with different frequency, which implies that a user performs differently at different period, in this work, we define the dominant power spectrum of user as its maximum spectrum value across all potential frequencies: ( ( ))
(8)
which can be used to represent the overall activeness of user in social media. Similarly, the cross-spectrum ( ) between user behavior processes and is defined as follows: ( )
∑
̃(
)̃(
)
(9)
where ̃ ( ) denotes the complex-conjugate transpose of ̃( ) . We then have the spectral density matrix for the multivariate processes as: ( )
(
( )
( )
( )
( )
)
(10)
with the off-diagonal elements representing cross-spectrum and diagonal elements representing auto-spectrum.
3.1.3 Measure Users’ Similarity by Spectral Coherence In this work, we quantify the similarity of two user feature vectors by using spectral coherence. Spectral coherency for any pair of user feature vectors and is calculated as: ( )
( ) √
( )
(11)
( )
Spectral coherency is a complex quantity which provides estimation of the strength of coupling between two processes. Its absolute value is called spectral coherence with ranges from 0 to 1. A spectral coherence value of 1 indicates that the two signals have a constant phase relationship, and a value of 0 indicates the absence of any phase relationship. Although correlation can also indicate the coupling between two user feature vectors, we choose spectral coherence here since it does not only tell us how similar two user feature vectors are, but also informs us that at which frequency these two user feature vectors are similar. In this study, we obtain an overall spectral coherence score for each pair of users by summing over their spectral coherence value in different frequency, so that we have ∑
( )
(12)
It is important to note that this spectral coherence score between two users is employed to represent the similarity of these two users across all frequencies. It can be used in many practical applications.
3.1.4 Filtering Algorithm Given a similarity matrix of users, in this work, we use two kinds of filtering algorithms to construct networks. One is Mutual TopK Filtering and the other one is simple Thresholding. Mutual Top-K Filtering Given a similarity matrix, for each user , we sort the similarity scores between i and all other users in the descending order and retrieve users that have the top similarity score with , denoted as ( ). Parameter k is provided as an input, which is a percentage, indicating the proportion of users retrieved. As a result, has relatively higher similarity with users in ( ). Secondly, for each users in ( ) , we check whether also belongs to ( ) to ensures that has relatively high similarity with too. If it is true, we retain j in ( ), otherwise we remove j from ( ). We ensure that each user will be () associated to at least one other user. As a result, if is an empty set, we correlate with the user p that has the highest . At last, except for the relationships in Candidate(i) where i from 1 to N, we set all other elements in the similarity matrix equal to zero. By considering the original similarity matrix as a fully connected network, this filtering step removes edges
with relatively lower weights (similarity scores) and constructs a network. Below is the pseudo code of the Mutual Top-K Filtering Algorithm, where N denotes the total number of users: Mutual Top-K Filtering Algorithm INPUT: ( ) Spectral Coherence scores of any pair of N users, ranking threshold OUTPUT: clusters of users /*Filtering*/ 1: for each users 2: Sort spectral coherence score in descending order, where 3: to
Assign users with top spectral coherence score ( ) // ( ) stores users
//having high coherence with ; 4: end for 5: for each users 6: for each user in () ( )) then 7: if ( 8: retain in ( ); 9: else remove from () 10: end for () 11: if 12: Assign user p who has the highest coherence with to () 13: end if 14: end for Thresholding As Mutual Top-K Filtering, for each user , we also sort the similarity scores between i and all other users in the descending order. However, in this method, after calculating the similarities for each pair of users, we only retain the pairs whose similarity are larger than a predetermined threshold and remove the rest.
4. EXPERIMENT 4.1 Dataset In this experiment, we use the social media site, Digg.com, as testbed. Digg.com provides a platform for Web users to submit stories for discussions. Stories (or news) are submitted by Digg.com users. Users who find interest in a story can make comments or dig this story. For each story in Digg.com, we collected its story ID, submitter user ID, IDs of all the comments of this story, user IDs of all commenter, and the corresponding timestamps and saved in our dataset. We utilized the Digg API to collect the dataset from Digg.com. We arbitrary selected five popular newswires and all stories of these newswires submitted by Digg users to Digg.com for discussions were collected. These five news sources were: CNN, BBC, NPR, The Washington Post, and Yahoo! News. The whole dataset spanned 3 months from March 1st 2011 to May 31st 2011. During these three months, we collected all the “top news” (which are defined and recommended by Digg.com), comments of all these stories, all user information and timestamp. Specifically, for each story that we collected, we recorded its story ID, submitter ID, brief abstract of the story, all comments and the commenter’s IDs, and the corresponding timestamps for both the story and the comments. In total, there were 12,742 stories, 13,531 comments (only 590 stories have comments) and 286 active users (we
defined the users who were active in more than 10 days during this 3 months period to be active users and filtered the inactive ones).
4.2 Gold Standard To evaluate the performance of the proposed techniques, we recruited two human annotators to analyze the dataset and generate a Gold Standard. The generation of gold standard consists of the following steps. 1) 50 pairs of Digg users in the dataset were selected randomly, 2) the human annotators independently examined all the stories and comments submitted by each pair of users, identified the subjects of each story or comment, and determined the areas that each user was most interested in, 3) the annotators independently identified if a relationship of common interest existed between each pair of users. To determine the reliability of the Gold Standard produced by the two human annotators, we used the weighted Kappa measure to compute the inter-rater agreement. Weighted Kappa measure is a statistical measure extended from Kappa measure for computing the agreement between two ordered lists. It has a maximum value of 1 when there is perfect agreement between two raters and a value of 0 when the agreement is not better than by chance. In general, a weighted kappa measure value larger than 0.8 is considered a very good agreement between the two raters [3]. In our experiment, the weighted Kappa measure was 0.84, which means the two annotators had a very good agreement and the Gold Standard was reliable for the experiment.
4.3 Baseline Method To test the effectiveness of our proposed techniques, we compared our proposed technique with two baseline methods, 1×N and N×N, which only consider explicit interactions. For each story in the dataset, we had user ID of the submitter and the Digg users who made comments on this story. The 1×N method constructs a 1 to N network where each link corresponds to an interaction between the submitter and a commenter of the story. In 1×N method, we only considered the explicit relationship between two users when the submitter of a story received a comment from another Digg user. The N×N method constructed a fully connected network where any two users within a same story, including story submitter and all commenters, were connected by links. In N×N method, in addition to the “submit and comment” explicit interaction, we consider an explicit relationship occurring between two users when they are commenting on the same story that both of them are interested in.
4.4 Measurement In our experiment, precision, recall, and F-1 measures are used as the metrics to evaluate the performance of the proposed techniques. These metrics are measured in terms of four parameters true positive (TP), false positive (FP), true negative (TN) or false negative (FN) as illustrated in Table 1. Table 1 TP, FP, TN and FN Temporal Analysis Algorithms
Yes
No
Yes
TP
FN
No
FP
TN
Human Annotators
“Yes” denotes that the relationship is determined by an algorithm or the human annotators; “No” denotes that there is no relationship between the two users determined by an algorithm or the human annotators. If a relationship is determined by both an algorithm and the human annotators, it is a TP. The formulations of Precision, Recall and F1-Measure are presented below. Precision: It is defined as the number of TP divided by the number of TP plus FP. Precision = TP / (TP+FP) Recall: It is defined as the number of TP divided by the number of TP plus FN. Recall = TP / (TP+FN) F1-Measure: It is a harmonic mean of precision and recall which is defined as: F1-Measure = 2 × Precision × Recall / (Precision + Recall)
4.5 Results In this experiment, we used the results of Digg users’ relationships generated by human annotators as the Gold Standard. We applied the baseline approach and our techniques on the same dataset separately to generate the results of the relationships among Digg users. Both the 1×N and N×N baseline methods were employed to identify Explicit Relationship (ER), whereas both of the proposed Thresholding and Mutual Top-K Filtering approaches were used to identify the Implicit Relationship (IR). The results produced by each approach were compared with the Gold Standard respectively. Table 2 demonstrates the Recall and Precision trend, and Figure 1 presents the F1 measures of each approach.
Table 2 Recall-Precision Threshold
IR (Thresholding) Recall Precision
IR (Mutual Top-K Filtering) Recall Precision
0.05
0.17391
0.6667
0.13043
0.1
0.21739
0.55556
0.21739
0.50
0.15
0.21739
0.45455
0.26087
0.54545
0.2
0.26087
0.46154
0.34783
0.53333
0.25
0.30435
0.50
0.3913
0.50
0.3
0.3913
0.52941
0.3913
0.45
0.35
0.43478
0.50
0.52174
0.52174
0.4
0.52174
0.54545
0.52174
0.50
0.45
0.52174
0.52174
0.52174
0.50
0.5
0.52174
0.52174
0.52174
0.50
0.55
0.52174
0.52174
0.56522
0.48148
0.6
0.56522
0.50
0.6087
0.48276
ER(1×N) Recall Precision 0.043478261
1.00
0.60
ER (N×N) Recall Precision 0.217391304
0.833333333
0.6 0.5
F1 Measure
0.4 0.3 0.2 0.1 0.6
0.55
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0 Threshold Thresholding Baseline 1XN
Mutual Top-K Filtering Baseline NXN
Figure 1. F1 Measure As shown in Table 2, both Thresholding and Mutual Top-K Filtering approaches outperform the two baseline methods as the threshold increased, but the difference between the Thresholding and Mutual Top-K Filtering is not substantial. The baseline methods, either 1×N or N×N, achieve very high precision (i.e Precision = 1.00 for 1×N and Precision = 0.83 for N×N). This is because the number of explicit relationships that can be identified by 1×N and N×N is small but the extracted explicit relationships are true relationships. However, the recall is extremely poor (i.e. Recall = 0.04 for 1×N and Recall = 0.22 for N×N). This is because many true relationships cannot be extracted by the explicit relationships. On the other hand, the Thresholding and Mutual Top-K Filtering suffer in lower precision but they achieve substantially higher recall. That means the implicit relationships identification can extract substantially more true relationships; however, it also extract more false relationships. Figure 1 shows that both Thresholding and Mutual Top-K Filtering obtain better results than 1×N baseline method in F1 measure. Thresholding is superior to N×N baseline method in F1 measure with the threshold larger than 0.2, while Mutual Top-K Filtering with threshold larger than 0.15. And the difference increases as the threshold increases. By identifying the implicit relationships among Digg users, our proposed techniques can identify user relationship better than relying on explicit user relationships. The two baseline approaches based on explicit interactions achieved relatively higher precision but lower recall, which means that an explicit interaction between web users is an effective indicator to determine if two users have common interest. Unfortunately, these explicit interactions can only capture a small percentage of potential relationships between web users. Indeed, using the explicit interactions can only recall less than 22% of user relationship of common interest. In this work, we propose the temporal analysis technique to identify the implicit relationships. As shown in Fig. 1, using F-1 measure, our proposed technique consistently outperformed the 1×N baseline. In addition, as the threshold increases, these two proposed approaches also substantially outperformed the N×N baseline method. However, there is a tradeoff of the proposed technique. It recalls substantially more relationships but it achieves lower precisions. That means it extracts more false positives while it can extract substantially more true positives.
The true positives are considered important when we use the extracted relationships to recommend social media users to promote interactions. The social media sites have a large volume of user contributed content as well as a large number of users. It is impossible for any social media users to follow the relevant information from such a huge content space and user space. Each social media user is only aware of the existence of a relatively small number of users who have the common interest. It explains why there are so many isolated communities in social media. Unless there is a good recommendation system to feed the social media users about other users of common interests, the interactions in social media will still be limited. Identifying the true positives is important to connect the users and communities together even with the sacrifice of offering false positives. It takes the user effort to filter the false positives but the users can identify substantially more potential relationship that they are not aware of otherwise. On the other hand, if a recommendation system has a high precision (few false positives) but it only extracts very few true positives, the utility of the recommendations remains limited.
5. CONCLUSION Due to the popularity of social media, social commerce has drawn substantial attention in both academic and industry. Social commerce offers new opportunities for electronic commerce companies to collect user data for understanding their behavior as well as the market. A social network is composed of actors (social media users) and ties (user relationships). Social network analysis and mining techniques support social commerce by identifying potential consumers based on the user relationships. The user relationships in social media include the explicit relationships and the implicit relationships. The explicit relationships can be easily extracted by the direct interactions between users. However, the explicit relationships do not capture many user relationships. In this paper, we employ temporal analysis to extract the implicit relationships between social media users. We utilize spectral analysis and spectral coherence to measure the similarity between two user feature vectors. A simple thresholding method and a Top-K Filtering method are developed to extract the implicit relationships. An experiment is conducted using the Digg.com dataset. Two human annotators are recruited to generate the gold standard. The experiment result shows that the proposed techniques achieve substantially higher recall but suffer in precision. The results on the harmonic mean of precision and recall, F1 measures, show that the proposed techniques (implicit relationships) outperformed the baseline methods (explicit relationships). In the future, we plan to further enhance the filtering techniques as well as extending the current techniques for user clustering and community detection applications.
6. REFERENCES [1] Adamic, L. A. and Adar, E. 2003. Friends and neighbors on the web. Social networks, 25, 3 (2003), 211-230. [2] Al Hasan, M., Chaoji, V., Salem, S. and Zaki, M. 2006. Link prediction using supervised learning. Citeseer, 2006. [3] Bakeman, R. and Gottman, J. M. 1997.Observing interaction: An introduction to sequential analysis. Cambridge Univ Pr. [4] Bonchi, F., Castillo, C., Gionis, A. and Jaimes, A. 2011. Social network analysis and mining for business applications. ACM Transactions on Intelligent Systems and Technology (TIST), 2, 3 (2011), 22.
[5] Chien, S. and Immorlica, N. 2005. Semantic similarity between search engine queries using temporal correlation. ACM, 2005. [6] Goebel, R., Roebroeck, A., Kim, D. and Formisano, E. 2003. Investigating directed cortical interactions in time-resolved fMRI data using vector autoregressive modeling and Granger causality mapping. Magnetic Resonance Imaging, 21, 10 (2003), 1251-1261. [7] Granger, C. 1969. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society, 37, 3 (1969), 424-438. [8] He, Q., Chang, K. and Lim, E. 2007. Analyzing feature trajectories for event detection. ACM, 2007. [9] Hirabayashi, T., Takeuchi, D., Tamura, K. and Miyashita, Y. 2010. Triphasic Dynamics of Stimulus-Dependent Information Flow between Single Neurons in Macaque Inferior Temporal Cortex. Journal of Neuroscience, 30, 31 (2010), 10407. [10] Katz, L. 1953. A new status index derived from sociometric analysis. Psychometrika, 18, 1 (1953), 39-43. [11] Liben-Nowell, D. and Kleinberg, J. 2007. The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58, 7 (2007), 10191031. [12] Lichtenwalter, R. N., Lussier, J. T. and Chawla, N. V. 2010. New perspectives and methods in link prediction. ACM, 2010. [13] McPherson, M., Smith-Lovin, L. and Cook, J. M. 2001. Birds of a feather: Homophily in social networks. Annual review of sociology, (2001), 415-444. [14] Mitra, P. and Pesaran, B. 1999. Analysis of dynamic brain imaging data. Biophysical journal, 76, 2 (1999), 691-708. [15] Mukhopadhyay, N. and Chatterjee, S. 2007. Causality and pathway search in microarray time series experiment. Bioinformatics, 23, 4 (2007), 442. [16] Newman, M. E. J. 2001. Clustering and preferential attachment in growing networks. Physical Review E, 64, 2 (2001), 025102. [17] Provost, F., Dalessandro, B., Hook, R., Zhang, X. and Murray, A. 2009. Audience selection for on-line brand
advertising: privacy-friendly social network targeting. ACM, 2009. [18] Slepian, D. 1961. First passage time for a particular Gaussian process. The Annals of Mathematical Statistics, 32, 2 (1961), 610-612. [19] X. Tang, C. C. Yang, and X. Gong, “A Spectral Analysis Approach for Social Media Community Detection,” Proceedings of International Conference on Social Informatics 2011, Singapore, October 6 – 8, 2011. [20] C. C. Yang and X. Tang, “Social Networks Integration and Privacy Preservation using Subgraph Generalization,” Proceedings of AMC SIGKDD Workshop on CyberSecurity and Intelligence Informatics, Paris, France, June 28, 2009. [21] C. C. Yang, Y. C. Wong, and C. Wei, “Classifying Web Opinions for Consumer Product Analysis,” Proceedings of the 11th International Conference on Electronic Commerce, Taipei, Taiwan, August 12-15, 2009. [22] C. Yang, C. Wei, and C. C. Yang, “Extracting Customer Knowledge from Online Consumer Reviews: A Collaborative-Filtering-based Opinion Sentence Identification Approach,” Proceedings of the 11th International Conference on Electronic Commerce, Taipei, Taiwan, August 12-15, 2009. [23] C. C. Yang and B. Thuraisingham, “Privacy-Preserved Social Network Integration and Analysis for Security Informatics,” IEEE Intelligent Systems, vol.25, no.3, 2010, pp. 88-90. [24] C. C. Yang, X. Tang, Y. C. Wong, and C. Wei, “Understanding Online Consumer Review Opinions with Sentiment Analysis using Machine Learning,” Pacific Asia Journal of the Association for Information Systems, vol.2, no.3, 2010. [25] C. C. Yang and T. Ng, “Analyzing and Visualizing Web Opinion Development and Social Interactions with Density Based Clustering,” IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, vol.41, no.6, 2011, pp.1144-1155. [26] C. C. Yang, X. Tang, J. Huang, and J. Unger, “A Study of Social Interactions in Online Health Communities,” Proceedings of IEEE International Conference on Social Computing, Boston, MA, October 9 – 11, 2011.