2015 IEEE International Conference on Services Computing
Personalized QoS Prediction via Matrix FactorizationIntegrated with Neighborhood Information Kesheng Qi1,Hao Hu1, Wei Song2, Jidong Ge3, Jian Lü1 1
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China 3 Software Institute, Nanjing University, Nanjing, China
[email protected],
[email protected]
2
reality, we know that the QoS of the same service may vary widely to different users in most of the cases because the QoS is susceptible to the users’ heterogeneous environments such as network conditions, locations and so on. A user’s ability is limited and it is impractical for him to invocate all the candidate services to acquire the QoS values because: 1) evaluating all the candidate services is time-consuming. 2) It is resource-consuming for the service providers when conducting Web service invocation for evaluation in order to test or collect QoS data. 3) Many services over the Internet are charged. Accordingly, personalized QoS prediction is regarded as a promising way for users to know the QoS before actual invocation [6], [7], [8], [9]. Figure. 1 shows the phenomenon in a simple example. Each entry in the user-service invocation matrix represents the observed value of QoS property (e.g., response-time in the example) of a Web service invoked by a user. QoS prediction is to help current user calculate and predict the QoS values before invocating the target Web service via taking advantage of other users’ experiences of Web service invocations. In other words, the task is to obtain the missing values in the user-service invocation matrix. Collaborative Filtering (CF) has been used in the QoS prediction in recent literatures [6], [10], [11]. CF algorithms fall into two categories: memory-based and model-based [12]. Two kinds of memory-based CF, user-based and item-based CF, are mainly employed, both of which utilize the useful QoS information from the current user’s (item’s) similar users (items) [13]. Pearson correlation coefficient (PCC) [14] is usually employed as the similarity measurement to find out a set of similar users or items. However, we argue that PCC relies too heavily on the history QoS records while the available QoS values in the user-service matrix may be inaccurate because of the complex network environment. Furthermore, the user-service matrix is very sparse in fact, which could also result in the inaccuracy of the similarity calculation. Besides, each QoS value in the matrix is determined by the physical environment, which is objective and not the subjective judgment of a user to a service. Thus we should take the physical information of Web service into consideration to further improve the accuracy of QoS prediction. Matrix Factorization (MF), a model-based collaborative
Abstract—Currently, the number of Web services on the Internet is growing exponentially. Faced with a large number of functionally equivalent candidate services, users always hope to select the optimal one that can provide the best QoS values. However, users usually do not know the QoS values of all the candidate services as the limited historical service invocation records. Different QoS prediction methods are presented to predict QoS values of candidate services. Nevertheless, most of them do not take the physical features of Web services fully into consideration and thus the prediction accuracy is still not satisfying. To this end, we propose a novel Matrix Factorization method, integrating both user network neighborhood information and service neighborhood information with Matrix Factorization model, to predict personalized QoS values. To validate our method, experiments are conducted on a real-world Web service QoS dataset including 1,974,675 Web service invocation records. Experimental results show that our method performs better in prediction accuracy than the state-of-the-art methods. Keywords-Web service; QoS prediction; matrix factorization; neighborhood information
I.
INTRODUCTION
Web services are self-described software applications designed to support interoperable machine-to-machine interaction over a network via standard interfaces and communication protocols [1]. Because of the self-describing, reusable and loosely coupled advantages, many companies would rather use the existing Web services to develop their systems, which can greatly shorten the system development periods. However, with the prevalence of Web services, the number of services on the Internet increases rapidly and service users are usually faced with many functionallyequivalent services. It is desirable to select the best performing one from them. Quality-of-Services (QoS), a group of attributes (e.g., response-time, throughput, reliability, etc), is widely used to characterize the nonfunctional performance of Web services and plays a critical role in service selection [1], [2], [3]. Service users can select the optimal nonfunctional performance according to the QoS values from the functionally-equivalent candidate services. Extensive studies on QoS-Based service selection have been presented [2], [4], [5]. Their common premise is that the users have already known the QoS values of all the candidate services. But in 978-1-4673-7281-7/15 $31.00 © 2015 IEEE DOI 10.1109/SCC.2015.34
186
U1 U2 U3 U4
S1 1ms
S2
S3 2ms
2ms 2ms 6ms
candidate services are already known and accurate, which could not be met in reality. So QoS prediction is significant for service computing. Currently, collaborative filtering is the mainstream approach for QoS values prediction. User-based collaborative filtering was first applied to solve this problem by Shao et al. [6]. Afterwards, Zheng et al. [7] proposed a hybrid prediction approach by fusing the prediction results of user-based CF and item-based CF, where confidence weights were used to balance the effects of two methods. However, the methods proposed in [6], [7] just only use PCC for similarity computation, without considering the physical features of Web service invocations, which could result in the inaccuracy. Chen et al. [19] first considered the influence of users’ location and clustered users by their location and history records before collaborative filtering step. Tang et al. [20] proposed a location-aware collaborative filtering method, incorporating both of the users’ location and services’ location to find similar users and similar services. The method employs PCC to calculate the similarity between the current user (service) and those users (services) located in the same Autonomous System (AS)1 and country first to get the neighborhood information. Nevertheless, when the number of users and services becomes large, user-service invocation matrix usually becomes large and sparse, these methods, classified as memory-based CF, cannot have a good performance. Matrix Factorization as a model-based CF can still do well when the user-item matrix is very large and sparse and has been widely adopted in traditional recommender system [15]. Lo et al. [8] proposed two novel regularization terms, which denote the difference between the feature vectors of the current user or service and their neighbors respectively, and appended them to the basic MF model. But, the approach of getting the neighborhood information was calculated by PCC, which ignored the factor of the environment. They also raised a location-aware method [21] by employing another original regularization term according to the intuition that users in near area tend to share similar Web service invocation experience. However, we know that the regularization term in the MF model is to avoid overfitting in the learning process. So the methods in [8], [21] have difficulty in giving a convincing interpretation for the neighbors’ contributions to the QoS values. Zheng et al. [16] proposed a prediction approach via neighborhood integrated matrix factorization which treated the predicted QoS value as the ensemble of a user’s information and the user’s neighbors’ information. Nonetheless, the method obtains the users’ neighbors by PCC, which can also result in inaccuracy. Xu et al. [22] adopted a similar approach, which defines the neighbors are those users whose distance between them and the current user is less than a distance threshold value. The distance is measured by Euclid Distance, calculated via user’s longitude and latitude. However, we know that although two users are near in physical distance, they may be very far in network distance. In this paper, we make use of the users’
S4 5ms 1ms 3ms
Figure. 1. User-Service Invocation Matrix
filtering method, has been proposed to predict ratings and used in the recommender system recently [15]. MF is good at estimating overall structure that relates simultaneously to all users or items but poor at detecting strong associations among a small set of closely related users or items, precisely where memory-based CF would do better [16]. Besides, we have known that some QoS values in the user-service matrix may be inaccurate. Nevertheless, the neighborhood of the current user (service) should have the similar experience with the current user (service). So, we collect the wisdom of crowds, making use of the strong associations mentioned above, to improve the prediction accuracy. In this paper, we propose an approach to predicting QoS values by integrating neighborhood information of both users and services into the matrix factorization model. Moreover, when selecting every user’s neighborhood, we incorporate the user’s network location information and PCC to generate the robust neighborhood based on the idea that users who are in close network location will share similar service invocation experience due to the same IT infrastructures (e.g., network workloads, routers, etc). In the end, we conduct extensive experiments on a large-scale real-world Web service QoS dataset and the experimental results show the effectives of our approach. In summary, the contributions of this paper are as follows: x We integrate the neighborhood information of both the users’ and services’ into the matrix factorization model for personalized QoS prediction. x We incorporate users’ network location and PCC to generate users’ robust neighborhood. x Extensive experiments are conducted on a real-world dataset, and the experimental results demonstrate that our method performs better than other state-ofthe-art methods. The remainder of this paper is organized as follows: Section II reviews some related work on personalized QoS prediction. Section III presents the overview of our personalized QoS prediction approach. Section IV details the selection process of the users’ and services’ neighborhood information. Section V introduces how to revamp the matrix factorization model with the neighborhood information. Section VI discusses the experimental results. Finally, Section VII concludes the paper. II.
RELATED WORK
As the technology of service computing becomes popular, QoS plays an important role in service selection [5], service discovery [17], service composition [18], and so on. Most of the previous research assume that the QoS values of the
1 An autonomous system (AS) is a collection of IP networks and routers under the control of one entity (such as a university, and a business enterprise). An autonomous system has a globally unique number (ASN).
187
User QoS Records
Information Collection Handler
Network Location Information Handler
Service Users
Prediction Results
UserService Invocation Matrix
Find Similar Users
services based on the training results and recommend him the optimal Web service.
Find Similar Services
IV. Neighborhood Integrated Model Training
Output Handler
In this section, we will introduce our method of finding the neighborhood of users’ and services’ in detail. In previous work, Pearson Correlation Coefficient (PCC) and Vector Space Similarity (VSS) are two usually approaches for similarity computation. Besides, as PCC can always get the better performance than VSS [12], we apply PCC in our method. PCC value ranges from -1 to 1, and larger value means higher similarity.
Training Results
Active User UDDI Registry
A. User Neighborhood Selection The formula of PCC for user similarity computation between two users i and k is as follows:
Figure. 2. Overview of Our Prediction Approach
network location to get users’ network neighbors and combine the information of both users’ network neighbors and services’ neighbors with the matrix factorization model for QoS prediction. III.
NEIGHBORHOOD SELECTION
(, ) =
∑ ∈ ( − )( − ) ∑ ∈ ( − ) ∑ ∈ ( − )
,
(1)
Where J denotes the subset of services that both invoked by user i and user k, represents the QoS value of Web and mean the service j observed by service user i. average QoS values of different Web services invoked by service user i and j respectively. Positive PCC value means the two users are similar and negative denotes dissimilar. As is mentioned in [24], two dissimilar service users can be similar when they happen to have a few co-invoked Web services. So we use a similarity weight to reduce the influence of the overestimated similarity. We adjust the formula of PCC to compute the similarity of user i and user j as follows:
OVERVIEW OF OUR APPROACH
In this section, we will introduce the overview of our QoS prediction approach. As is shown in Figure. 2, our approach mainly include these procedures: 1) We collect the information from the service users, including their IP addresses, Web service invocation records. We can easily obtain service users’ IP addresses, while for their Web service invocation QoS records, we can provide a web form for users to input their QoS values directly or by employing a client-side middleware to collect the QoS values automatically [23]. Then the users’ Web service invocation QoS records will be used to form the user-service invocation matrix. 2) In this procedure, we mainly identify users’ network location according to their IP addresses. The same with the method in [20] of describing a user’s location, we also employ a triple (IPu, ASNu, CountryIDu) to denote a user’s network location, where IPu denotes the user’s IP address, ASNu denotes the number of Autonomous System which the IPu belongs to, and the CountryIDu means the ID of the country which the IPu belongs to. The reason we use the Autonomous System (AS) to represent the users’ network location is that users share the same network condition, as all routers are running the same routing protocol. Besides, we can easily get the ASNu and the CountryIDu according to the IPu via many services on the Internet such as the iplocation service2. 3) We both use the users’ network location and the userservice invocation matrix to get the user neighborhood information and the service neighborhood information. Then we integrate the neighborhood information of both the users’ and services’ into the MF model to do model training. 4) When the active user submits his query request, the output handler will return the QoS values of the candidate
(, ) =
2 ∗ | ∩ | (, ), | ∪ |
(2)
| ∩ | denotes the number of Web services which are commonly invoked by user i and user k and | ∪ |
represents the number of Web services invoked by either user i or user j. When the | ∩ | is small, the similarity weight 2 ∗ | ∩ |/| ∪ | will be small and can reduce the effect of overestimated similarity to some extent. After the procedure of computing the similarity between the current user and other users with the adjusted PCC values, we can define the set of Top-K similar users as the current user’s neighbors. However, the number of similar users of the each user is limited. Moreover, the traditional Top-K algorithms ignore this case and may include the dissimilar users with negative PCC values as the neighbors, which can cause the noise. So we revamp the traditional Top-K algorithm as follows: () = {| ∈Top-K(i), (, ) > 0, ≠ },
(3)
where N(i) is the set of neighbors of the current user i, TopK(i) means the set of the Top-K most similar users. Considering that the performance of the Web services are determined by the physical environment, we know that users in the same AS will share similar service invocation experience because of the similar network condition. We can
2 http://www.iplocation.net/
188
is the key task. Let an × matrix ! = {" } denotes the user-service invocation matrix. The matrix factorization approach utilizes a rank-d matrix # = $ to fit it, where ∈ %× and S ∈ %× represent the latent feature matrices of users and services respectively. Then the inner product of the two latent feature vectors, and , is used to approximate the QoS value " in the user-service invocation matrix as
employ both the user’s network location and QoS history records to generate the robust user neighbors. The approach includes the following steps: Step 1: Compute the similarity between the current user u and the other users located in the same AS with u. If the number of the u’s similar users is equal or greater than the threshold k, goto Step 4. Step 2: Compute the similarity between the current user u and the other users located in the same country with u. If the number of the u’s similar users is equal or greater than the threshold k, goto Step 4. Step 3: Compute the similarity between the current user u and all the other users regardless of the location. Step 4: Let the u’s Top-K most similar users be the neighbors of u, represented by N(u);
" ≈ $
To make the difference between the matrix P and the original matrix Q be minimum, we can minimize the following term:
∑∈( − )( − ) ∑∈( − ) ∑∈( − )
2 × | ∩ | (, ), | ∪ |
Solving the Eq. (8) directly will cause the overfitting problem [25]. So in the learning process, we usually add the following regularization terms in practice to avoid overfitting:
, (4)
min 7(, ) = ,'
-. -.
(9)
where ‖∙‖2C is the Frobenius norm, and