Top K Query for QoS-Aware Automatic Service ... - IEEE Xplore

1 downloads 0 Views 1MB Size Report
Top K Query for QoS-Aware Automatic Service. Composition. Wei Jiang, Songlin Hu, and Zhiyong Liu. Abstract—With the proliferation of Web services, service ...
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON SERVICES COMPUTING 1

Top K Query for QoS-Aware Automatic Service Composition Wei Jiang, Songlin Hu, and Zhiyong Liu Abstract—With the proliferation of Web services, service engineers demand automatic service composition algorithms that not only synthesize the correct service compositions from thousands of services but also satisfy the quality requirements of users. This is known as QoS-aware automatic service composition problem. Our observation is that current research of only finding the optimal service composition result has several shortcomings. Users have to utilize the optimal one, which will make it rigid, and consequently brings about problems, such as overload of “hot services” and lack of choices for users. To cope with these problems, a top k query mechanism is introduced in this paper, a progressive and incremental Key-Path-Based Loose (KPL) algorithm with 100% accuracy is proposed. Our QSynth, which won the performance championship of Web Service Challenge 2009 and 2010, is extended to support top k query based on KPL algorithm. Evaluations show that, compared to the state of the art, KPL algorithm achieves superior scalability and accuracy with respect to a large variety of composition scenarios. Moreover, we generalize a new graph problem: top k DAGs (Directed Acyclic Graphs) problem based on the above work. Applications of this new graph problem contain API recommender, supply chain and so on. KPL algorithm illustrated in this paper can address them efficiently too. Index Terms—Automatic Service Composition, QoS-Aware,Top K Query

!

1

I NTRODUCTION

S

ERVICE composition aims at reusing and composing existing atomic Web services to build rich functionalities, and becomes very popular when developing service oriented applications. The service composition result is also named as composite service, which defines the invoking structure and the control flow of participating atomic services. Our nearly one year public Web service survey based on Amazon EC2 shows that the number of Web services in the Internet becomes large [1], it is far from practical to ask users to manually select and composite interoperable services. Thus, automatic service composition, which could automatically generate correct service compositions that satisfy the functional request, comes into being and attracts lots of attention from both research and industrial communities [2], [3], [4], [5]. Real scenarios from Amazon [6] and SAP [7] show that it is helpful for modelling and verifying of service compositions. In addition to functional correctness, such service compositions also need to have good global quality of service (QoS) in terms of speed, cost, reliability and so on. The global QoS is computed based on the QoS of individual service. QoS of individual service usually can be retrieved from Service Level Agreement(SLA), which is the contract between service provider and user about QoS, e.g., Amazon S3 services SLA guarantees the providing of a response within 300 ms for requests in most cases [8]. The combination of automation and guarantee of global QoS leads to the so-called problem: QoS-aware automatic service composition (ASC) problem. Note that this is different from

• Wei Jiang, Songlin Hu and Zhiyong Liu are with Institute of Computing Technology, Chinese Academy of Sciences, China. Wei Jiang and Songlin Hu are also with State Key Laboratory of Software Engineering, Wuhan University, China. Liu is also with the State Key Laboratory of Computer Architecture in the Institute of Computing Technology, Chinese Academy of Sciences. Now, Wei Jiang is with Greatwall Drilling Company R&D Academy of Well Logging, CNPC. Songlin Hu is the corresponding author. E-mail: jiangwei, husonglin, [email protected]

Digital Object Indentifier 10.1109/TSC.2013.41

QoS-aware service selection problem [9], [10] which needs a predefined abstract process or composition template as the input. Most existing works [11], [12], [13], [14] treat QoS-aware ASC problem as a graph search or an AI planning problem. With the graph view, query of service composition is transformed into searching of sub-graphs from service dependency graph generated according to interoperable relationship among services. However, these works [11], [12], [13], [14] only focus on finding the optimal service composition that with the best global QoS. Like the presidential election with only one candidate violates the wish of people, the lack of alternative composition results brings somewhat inconvenience to users. Furthermore, the returned optimal service composition may not be the favorite one because of users’ various preferences besides quantitative QoS, e.g., brand preference. Thus, we propose to retrieve top k service composition results1 to avoid these limitations in some ways and bring the following benefits. (1) Replaceability. Whenever some atomic services in current chosen service composition result are invalid, we may replace this composition result with another one among the top k results to fulfill the query without launching the query again. (2) Differentiated service. Users may have some preferences that can not be measured by quantitative QoS value. For example, users in China mainland may prefer Baidu’s services to other search services, and tend to select a service composition with a Baidu’s service even though the service composition’s QoS might be a little bit worse than the others. Unlike the optimal service composition with no choice, top k query still provides several good candidate solutions for further personalized recommendation. (3) Load balance. From the service providers’ perspective, it is also better to provide top k service compositions than single optimal composition so as to avoid “hot” composition or overload of “hot atomic services”. 1. Top k service composition results refer to these service compositions whose overall QoS values are among the top k ones in all service compositions that can satisfy the query. The optimal service composition has the same meaning to top 1 service composition.

1939-1374/13/$31.00 © 2013 IEEE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON SERVICES COMPUTING 2

Meanwhile, the accuracy2 of top k query should be as high as possible, which can in turn provides more high-quality alternatives for users. Otherwise, the false top k composition results with bad global QoS may violate the QoS requests of users. In conclusion, it’s more proper to provide top k results with high accuracy than top 1 result from the perspectives of both users and providers. But it is not trivial to address top k query for the following reasons. • It is very costly to obtain top k service composition results by enumerating all results and ranking them then by current solutions. This is confirmed in [15], especially when the scale of services is very large. Take a simple composition template with only 4 tasks as an example, if each task has 10 candidates, there will be as large as 104 candidate results. • The existing approaches for optimal service composition results can not be applied into top k query directly. Intuitively, we can retrieve the optimal composition result by existing approaches. Then, we remove the services contained in the optimal result from the service set and find the next optimal result with the remaining service set recursively. However, this is infeasible. For some services in the top 1 result may also in the other top k results. If we remove these services already, the results may be false. • The solutions for top k problem from other related research fields (e.g., top k shortest path problem) can not address our problem for several reasons. The details will be discussed in Section 7 and Section 8. In order to cope with top k problem in QoS-aware automatic service composition, we design a novel key-path loose (KPL) algorithm. We also extend our tool, QSynth [13] which has won the performance championship of Web Service Challenge 2009, 2010, to supports top k query based on the algorithm. The main contributions of this paper are as follows: • We design KPL algorithm to address top k query of QoSaware automatic service composition problem. • We extend KPL algorithm to support multiple QoS by combining the approaches from the research of ranking on multidimensional datasets. • We generalize a new graph problem: top k DAGs problem, which could be used to model many other applications, such as API recommender and supply chain. The remainder of the paper is organized as follows. Section 3 presents the definition of top k query of QoS-aware automatic service composition problem and its background. Section 4 explains our key-path-based loose algorithm. Section 5 shows the evaluation. Section 6 describes how to extend our algorithm to multiple QoS criteria. Section 7 proposes a new graph problem and its applications. Section 8 introduces the related work. Finally, Section 9 presents our conclusions.

2

A M OTIVATING E XAMPLE

In order to illustrate (top k) QoS-aware automatic service composition problem, we present an example. If you are familiar with QoS-aware ASC problem, please skip this section. Assuming that you have got a document, and you want an online service to convert it to PDF file and fax it to someone 2. The accuracy is used to judge whether our retrieved top k results are really the k best composition results. The accuracy of existing heuristic approach is not guaranteed as shown in our evaluation.

else as quickly as possible. As illustrated in the top of Fig.1, this process contains four tasks: “virus check”, “spell check”, “pdf conversion” and “fax” (Each task can be fulfilled by one atomic service or one composite service). Particularly, for spell check, we need “dictionary service”, “thesaurus service” and “grammar service” together to fulfill it. The details of 9 candidate services are shown in Table 1, where the inputs, outputs, function and QoS of them are presented. We build a dependency graph among these services with two extra nodes, Start and End node for the request. This dependency graph is shown at the bottom of Fig.1. Each vertex or node in the graph represents a service and the weight or QoS of this vertex corresponds to the service’s response time. We connect two services, WA and WB , from WA to WB if one of WA ’s outputs matches one of WB ’s inputs. We use letters, (D,F,P,V,S1 ,S2 ,S3 ), to represent services’ inputs and outputs. Document

PDF Conversion

Virus Check

Fax

Spell Check 8 ms

VirC1 20 ms D

VirC2

V V

D

20 ms D

S1

Dic1

S2

15 ms

S3

10 ms

S1, S3

Gram1

PDF Conversion

F

End

400 ms P

Thes1 D D

Fax1

P

PDF1

Start D

800 ms 200 ms

Fax2

F

Fax (Two providers)

18 ms

DicGram1

Fig. 1. An online document preparation system

Virus Check VirC1 Start

PDF1 Thes1 1,

DicGram1

Fax2

End

2 3

PDF Convertion

Fax

Spell Check

Fig. 2. Optimal Result of online document preparation system Our goal is to fulfill the above process with the lowest response time (top 1 query). With this in mind, we try to choose the proper service instances for each task. • Virus check task: V irC1 is chosen to fulfill virus check task for it has lower response time than V irC2 . • Spell check task: There are at least two possible ways. One is to combine Dic1 , T hes1 and Gram1 together to fulfill

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON SERVICES COMPUTING 3

TABLE 1 The Details of Each Service in Fig.1 Service V irC1 V irC2 Dic1 T hes1 Gram1

Input document document document document document

DicGram1

document (D)

P DF1 F ax1 F ax2

(D) (D) (D) (D) (D)

passed virus and spell check document (V,S1 ,S2 ,S3 ) PDF file (P) PDF file (P)

Output no virus document (V) no virus document(V) passed dictionary test document (S1 ) passed thesaurus test document(S2 ) passed grammar test document(S3 ) passed dictionary and (S ,S ) grammar test document 1 3

Function virus check virus check dictionary service thesaurus service grammar service dictionary and grammar service

Response Time 8ms/1M file 20ms/1M file 20ms/1M file 15ms/1M file 10ms/1M file

PDF file (P)

PDF conversion

200ms/1M file

fax (F) fax (F)

fax service fax service

800ms/1M file 400ms/1M file

it. The other is to integrate T hes1 and DicGram1 . Because these services can be executed in parallel, response time of the former one is max(20, 15, 10) = 20ms. Response time of the latter one is max(15, 18) = 18ms, which is better than the former one. So the latter one is chosen. • PDF task: It has only one choice P DF1 . • Fax task: F ax2 is chosen for it has lower response time than F ax1 . Based on the above analysis, the optimal service composition result is retrieved and presented in Fig.2. This result is a DAG (Directed Acyclic Graph) here, rather than a simple path or a chain of services. Its overall response time is max(8, 15, 18) + 200 + 400 = 618ms. The DAG contains three composition patterns (split, joint, and sequence) [13], [16] as shown in Fig.2. In this example, only top 1 service composition result is presented to users. But users may want to choose a service composition with a little longer response time but satisfies their other special preferences which can not be declared by quantifiable QoS, such as brand preference and users’ habits. For example, one user may like to choose V irC2 other than V irC1 for V irC2 is his/her favorite brand. In this case, response time of this new composition result will be 620ms, which is only a little longer (2ms) than the optimal one, but the user may prefer this to the one with 618ms for his/her brand preference. This situation shows one limitation of top 1 query. Meanwhile, the accuracy is necessary and important too. In Fig.3, if we return a false top 1 result which contains F ax1 rather than F ax2 , this will extend the overall response time from 618ms to 1018ms. For a user who minds the response time so much, this is hard to accept since the overall response time of the false top 1 result is about 65% worse than the real top 1 result.

3 3.1

BACKGROUND Terms and Definitions

Table 2 presents some terms used in this paper. Definition of top k QoS-aware automatic service composition problem is as follows. Particularly, when k=1, this problem is reduced to the optimal (Top 1) QoS-aware automatic service composition problem. Definition 3.1: Top k query for QoS-aware automatic service composition problem. Given a set of services and a request R, the set SCAll represents all the service compositions that can satisfy R. Each service composition in SCAll defines an invoking structure over a set of Web services (W1 , W2 , . . . , WN ) by satisfying the following two conditions:

18ms/1M file

TABLE 2 General Terms. 1 ≤ i ≤ N , 1 ≤ j ≤ M , N is the number of services, j is the dimension of QoS Terms

Meaning

QoS

Non-functional properties of Web service, such as response time and throughput. R refers to a user’s individual request. IR specifies the information that this user can provide in terms of type definition and OR declares what the user needs. Wi = {IWi , OWi , Wij .self QoS|M j=1 }. IWi represents the input parameters of Wi , and OWi represents the output parameters of Wij .self QoS is the jth dimension QoS criterion value of Wi . Two Web service parameters, Pa and Pb are declared as matched not only by their types but also by the ontological relationships. We use “exact” match [17] to judge parameter match relationship. Let Concept(Pa ) be the concept which the parameter Pa belongs to. Here, Pa and Pb can be matched if Concept(Pa ) is subClassof [17] or the same as Concept(Pb ). The subClassof is transitive. Please refer to [18] for details. Given two Web services, Wa can match Wb if some outputs of Wa can match  some inputs IWb = ∅. Wa of Wb . We denote it as: OWa fully matches Wb , if O Wa ⊇ IWb . Wa partially matches Wb if (OWa IWb = ∅) ∧ (OWa  IWb ).

Request (R)

Web Service (Wi )

Parameter Match

Web Service Match

(1){IR ∪ OW1 ∪ . . . ∪ OWi } ⊇ IWi+1 (1 ≤ i ≤ N − 1); (2){OW1 ∪ OW2 ∪ . . . ∪ OWN } ⊇ OR ; We say all Wi in SCAll are enabled when the above formulas hold. In other words, we say a service is enabled when all its inputs can be provided by other enabled services. IR is always enabled. The set SCT opK represents those service compositions whose overall QoS values (allQoS) are the top k ones among SCAll . Formally, each service composition in SCT opK satisfies the following condition. (3) ∀SC  ∈ SCT opK , SC ∈ (SCAll − SCT opK ) ∧ (SC.allQoS  SC  .allQoS) (  means better than). Our assumptions are: QoS values considered here are quantitative; QoS values are static, which are retrieved from SLA files. We do not consider the situation of dynamic QoS or service in this paper. If you are interested in dynamic QoS and services, please refer to our paper [19].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON SERVICES COMPUTING 4

3.2

TABLE 3 QoS Computation Rules,F1 , F2 ∈ F

QoS Computation Rules

QoS computation rules are already discussed in many papers [9], [10], [13] before. We only give a brief introduction of them here. Given a service composition result, we calculate its overall QoS based on its atomic constituent services. After that, we can rank different service composition results by their overall QoS. We first present four types of QoS measures: • sum type (such as response time) • min type (such as throughput) • multiplication type (such as reputation) • max type In addition, these QoS types can be categorized into two classes [9], [10]. • Negative: the higher the value, the lower the quality, such as response time and price. • Positive: the higher the value, the higher the quality, such as throughput and reputation. Thus, SCT opK in our definition refers to those composition results with the k smallest overall QoS values when the QoS type is negative (e.g.,response time). Since the composition result is represented by DAG3 here, we use three composition patterns: Sequence, Joint and Split to sufficiently represent the atomic structures of service composition result as illustrated in Fig.1. Finally, the DAG is transformed to a BPEL (Business Process Execution Languages) file as we have done in Web Service Challenge [20]. Table 3 shows how to calculate the overall QoS for service composition result. If we need to compute the overall value of the jth dimension QoS criterion (e.g., response time) of a sequence which contains Wi (1  i  N ), the computation (Wij .self QoS|N rule is: WN j .allQoS = i=1 ). Take Fig.2 for example: P DF1 .allQoS =



Notations

Meaning Suppose that there is a DAG which ends with WN , and there are N −1 Web services Wi , (1  i  N −1) before WN , WN j .allQoS is the overall QoS value of the jth dimension QoS  measures  from W1 to WN . One function in { , min, , max, . . . } based on different QoS types. Computation Rules

WN j .allQoS F Patterns Sequence: WN j .allQoS Joint: WN j .allQoS



= F1 Wij .self QoS|N i=1



 

(1  j  M )



N −1 = F1 F2 Wij .self QoS|i=1 , WN j .self QoS



= F1 W0j .allQoS, Wkj .self QoS

Split: Wkj .allQoS





(1 

k  N ). W0 is the split node before Wk .

dependency graph satisfying the basic condition that the union of input parameters of the direct successors of Start is a subset of R’s input parameters (IR ) and the union of the output parameters is a superset of R’s output parameters (OR ). The final solution is simply to find such sub-graphs that have optimal (top k) overall QoS values. Web Service

Input

W1

I

Output J

W2

J

A

W3

I

A

W4

A

B

W5

B,M

C

W6

I

D

W7

I

F

W8

F

B

W9

G

W10

H

D

W11

C,D

K

H

W12

I

M

max(V irC1 .self QoS, T hes1 .self QoS,



DicGram1 .self QoS), P DF1 .self QoS =

P DF2 .allQoS =





max(8, 15, 18), 200



= 218ms

(1)

 P DF1 .allQoS, P DF2 .self QoS =



218, 400



= 618ms

Fig. 3. An example of Dependency Graph (2)

The overall QoS value of a DAG/service composition result is the same as that of End node in the DAG. 3.3

Sim-Dijkstra Algorithm

This section presents Sim-Dijkstra algorithm [14] which is currently the best solution for optimal QoS-aware automatic service composition problem. Synthesizing service composition results starts by building a dependency graph4 , as illustrated in Fig.3, where we connect two services, WA and WB , if one of WA ’s outputs matches one of WB ’s inputs. The request is treated as two special nodes, Start and End, in the graph. W1 -W11 and A-J represent the services and their inputs/outputs. A candidate service composition result is essentially a connected sub-graph in this 3. Unfolding method [10] is used when the composition result contains cycle. 4. This is a directed graph that may contain cycles.

InputIIT Web Service

Input

Output

selfQoS

count

allQoS

W1

I

W2

J

J

5ms

1

5ms

A

35ms

1

35ms

W3 W4

I

A

30ms

1

30ms

A

B

10ms

1

40ms

W5

B,M

C

10ms

2

50ms

W6

I

D

20ms

1

W7

I

F

20ms

W8

F

B

28ms

W9

G

H

W10

H

W11

C,D

input optQoS parents

input List

A

30ms

W3

A

W4

B

40ms

W4

B

W5

C

50ms

W5

C

W11

D

20ms

W6

D

W11

20ms

F

20ms

W7

F

W8

1

20ms

G

0ms

Start

G

W9

1

48ms

30ms

1

30ms

H

30ms

W9

H

W10

D

40ms

1

70ms

I

0ms

Start

I

W1,W3,W6,W7,W12

K

5ms

2

55ms

J

0ms

Start

J

W2

K

55ms

W11

K

O_R(End)

M

20ms

W12

M

W5

W12

I

M

20ms

1

20ms

Start(I_R)

---

I,J,G

0ms

0

0ms

End(O_R)

K

---

0ms

1

55ms



Fig. 4. Main Data Structure



This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON SERVICES COMPUTING 5

3.3.1 Data Structure Each vertex Vi in the graph is a tuple in our data structure, Vi = {Wi , IWi , OWi , self QoS, allQoS, count}. Wi is the unique identification of the vertex. IWi and OWi represent the inputs and outputs of Wi respectively. self QoS is the QoS value of Wi . allQoS records the optimal overall QoS from Start to Wi . count is assigned to the number of inputs in IWi . We use a hash table, Reachable Precondition/Inputs Table (RP T ) to store the optimal QoS (optQoS) for a particular input and its providers (parents). There may exist several services that satisfy the same input with the same optimal overall QoS. The data structures and the corresponding values of Fig.3 are presented in Fig.4. We adopt a new approach, Inverted Index Table (illustrated in Fig.4) to represent the graph. In Inverted Index Table, every entry is a tuple (a parameter, vertices list), which means the inputs of each item in the vertices list contain the parameter. In this way, we can find all the incident edges tagged with parameter with the time of O(1) by the parameter. Apart from the above, a priority queue named enableSers is used, which stores the newly enabled services. 3.3.2 Details of Sim-Dijkstra Algorithm The general idea is: we judge enabled services on the dependency graph by a forward search from Start node and put them in the priority queue, enableSers. But every time, we firstly handle the service with the best overall QoS in enableSers. We then obtain its enabled successors and put them into the priority queue recursively. The pseudo-code of this algorithm is presented in Appendix at http://debs.ict.ac.cn/appendix.pdf. The details are: this algorithm starts by adding the enabled services by Start into the priority queue. The priorities of services in the queue are based on their allQoS value. So we pop the service with the best allQoS in the priority queue every time. For each output of this popped service, we subtract the counts of its successors by one (only when this output is provided by enabled services at the first time). A successor service is enabled when its count becomes zero. We add the newly enabled services into the priority queue and conduct the above process recursively until the queue is empty. The corresponding information is recorded in our data structures as shown in Fig.4. After conducting Sim-Dijkstra algorithm, we can use the information in RPT to generate the optimal DAGs by a backward search from End node to Start node. For each service, we find the optimal providers by its inputs recorded in RPT. The process of the backward search is shown by double arrow edges in Fig.3. allQoS of the optimal service composition is 55ms.

4

TOP K Q O S-AWARE AUTOMATIC S ERVICE C OMPO -

SITION

4.1

Basic Idea

With previous works [11], [12], [13], we know that it is not trivial to find the (nearly) optimal service composition result, not to mention top k results. So we try to find approaches to simplify the problem. Luckily, we find out a very important feature of service composition result: its key path. The key path is a chain of services in a sequence, which has the same overall QoS as that of its corresponding composition results. In this way, we can reduce the problem of how to find top k composition result (DAGs) into how to find top k paths, which is much simpler.

Briefly, our solution for top k query contains the following steps, which is also illustrated in Fig.5: (1) First, we conduct a forward search on the service dependency graph from Start node by Sim-Dijkstra algorithm. (2) Second, we retrieve the optimal key paths by a backward search based on the recorded information in Sim-Dijkstra algorithm. (3) Third, we generate DAGs by the optimal key paths. If the number of DAGs is bigger than k, return these DAGs to the users and terminate. Otherwise, we put these key paths into a priority queue and conduct the following steps recursively. (4) Forth, we retrieve the best key path5 in current priority queue recursively. The corresponding DAGs of it are generated. All generated DAGs are put into one list. Once the length of the list is greater than k, terminate. Otherwise, we loose popped key path to obtain new worse key paths6 and put these new ones into the priority queue again. What is key path and how to loose them will be explained in the following sections. In order to illustrate our approach more easily, we still assume the QoS is response time here.

(1)

(3)

(2)

(4)

Fig. 5. Top K Query Solution: KPL Algorithm

4.2

Key Path and Loose Operation

Before we show how to retrieve key paths and loose operation, some concepts are introduced first. Some Concepts: The first one is key predecessor. Formally, given a node Vj in a DAG, the key predecessor of Vj is Vi which satisfies the following predicate : {Vi | 1 ≤ i ≤ N ; Wi .allQoS = max(Wj .inputs.providers.allQoS)}

Formally, the key path of a DAG is the chain of key predecessors, whose overall QoS value is the same to the DAGs. 5. All the key paths in priority queue are ranked by their overall QoS. The best key path is the one with the best overall QoS in current priority queue. The optimal key path has the same overall QoS value with the optimal service composition, so the optimal key path is always the best key path at the first loop. 6. They are named worse key paths for they have worse overall QoS than the popped key path.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON SERVICES COMPUTING 6

According the above definition, we know that the key predecessor of a service in a composition result (or DAG) is the provider with the worst overall QoS (allQoS) among all the providers of this service’s inputs. Take Fig.3 as an example, its optimal service composition result is shown on the top of Fig.6. With regard to W5 in the DAG, it has two providers, W4 and W12 , for its inputs, B and M . Since W4 .allQoS(40ms) is worse than W12 .allQoS(20ms), W5 ’s key predecessor is W4 . The input satisfied by the key predecessor is key input, so B is the key input in this example. Similarly, W11 ’s key predecessor is W5 . By the definition of key path, key path of the DAG in Fig.6 is Start → W3 → W4 → W5 → W11 → End (55ms), which is shown at the bottom of Fig.6. Intuitively, the key path can be looked as the longest or the weightiest path from Start to End in the DAG. Assuming that there is another DAG (named dag) whose structure is the same to the DAG in Fig.6 except service W6 . The corresponding replacement service is W which has the same input/output as W6 . As long as W.self QoS ∈ (0, 50), the key path of dag is still Start → W3 → W4 → W5 → W11 → End (55ms). If W.self QoS > 50, the key path will be different, Start → W → W11 → End (>55ms). In this way, for the DAG in Fig.6, if we replace W6 by W9 + W10 from the dependency graph in Fig.3, the key path of this new DAG will be Start → W9 → W10 → W11 → End(75ms). The process of generating a new key path with worse overall QoS is called a loose operation. Another concept is prefix-path. Prefix-path of a vertex is the path from Start to the current vertex.

key predecessor for the head vertex of current uncompleted key path backward and recursively. This process starts from End node and stops until we reach Start node. Finally, these key predecessors construct the optimal key path. The optimal key path in Fig.6 is retrieved in this way by the information stored in RPT ( refer to Fig.4 ). With regard to the non-optimal key paths, they are retrieved by loosing available key paths recursively. We will discuss it soon. Key Path Properties: (1)The overall QoS value of key path is the same as the overall QoS value of the corresponding DAG. (2)A DAG may have one or several key paths whose overall QoS values are the same. (3)A key path corresponds to one or several DAGs. In other words, different DAGs may have the same key path. Loose Operation: Loose operation is an operation on key path, which is to change the key predecessor of a vertex in current key path to obtain a new key path (with worse overall QoS value). This vertex is called loose vertex. Note that: (a) overall QoS of the new key path is worse than that of current key path. (b) for the current key path, its loose operations are conducted in a backward order. Thus, End is the first loose vertex. The direct successor of Start is the last loose vertex. (c) for each loose operation, the new key predecessor of the loose vertex is chosen from the providers of the inputs of current loose vertex from the service dependency graph. Furthermore, this new predecessor must be the one with the smallest response time (e.g., the QoS is response time) among all providers whose overall QoS values are worse than the original key preprocessor. Formally, given a loose node Vi in a DAG, its new key predecessor is Vk which satisfies the following predicate: {k| 1 ≤ k ≤ N ; Wk .allQoS = min( p.allQoS | p ∈ Wi .inputs.providers; p.allQoS > Wi .keyP recedessor.allQoS) }

Fig. 6. Optimal Service Composition and Its Key Path of Fig.3 Key Path Retrieval: Similar to DAGs, key paths are ranked by their overall QoS too. Due to the overall QoS values of key pahts are the same as their corresponding DAGs, if we retrieve the top k key paths, we can generate the top k DAGs with the guide of these top k key paths then. The key paths corresponding to the optimal candidate DAGs are called “optimal key paths”. The key paths corresponding to nonoptimal DAGs are called “non-optimal key paths”. In this way, all key paths are divided into two categories and can be generated by two different approaches respectively. • Optimal Key Paths: generated by RPT. • Non-optimal Key Paths: generated by loosing existing key paths. After conducting Sim-Dijkstra algorithm, we can obtain the optimal key paths through the information recorded in RPT. The concrete process is: starting from End, we retrieve the

The prefix-path of this new predecessor is obtained like the backward search of result generation. Every vertex in the key path will be loosed to obtain new key paths in our algorithm. A concrete example of loose operation is presented as follows. Example 4.1: Take Fig.3 as an example too, the key path of the optimal DAG is KP1 , which is presented in Table 4 and the bottom of Fig.6. We describe the details of conducting loose operations on KP1 here. In a backward order, all the loose vertices of KP1 will be End, W11 , W5 , W4 , W3 . The details are as follows. •



• •

Loose End: Since End has no other providers except W11 as shown in the dependency graph, no new key path is generated. Loose W11 : Currently, W11 .allQoS is 55ms and its key predecessor is W5 . All of its provider are W5 , W6 , W10 , whose allQoS are 50ms, 20ms, 70ms respectively. Since the new key path must have a worse overall QoS, we replace the key predecessor W5 by W10 . In this way, a new key path, KP2 (75ms), is generated. Loose W5 : In a similar way, W8 is the new key predecessor of W5 and a new key path, KP3 (63ms), is obtained. Loose W4 : The new key predecessor of it is W2 and new key path, KP4 (60ms), comes into being.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON SERVICES COMPUTING 7

TABLE 4 Key Paths Name KP1 (Optimal Key Path) KP2 KP3 KP4 KP5



Key Paths (allQoS) Start → W3 → W4 → W5 → W11 → End (55ms) Start → W9 → W10 → W11 → End (75ms) Start → W7 → W8 → W5 → W11 → End (63ms) Start → W2 → W4 → W5 → W11 → End (60ms) Start → W1 → W2 → W4 → W5 → W11 → End (65ms)

Loose W3 : No new key path is generated for it has only one provider.

By loose operations on KP1 , we obtain three new key paths, KP2 , KP3 , KP4 , which are presented in Table 4. All of these new key paths are with worse allQoS than key path KP1 . 4.3

Key-Path-Based Loose (KPL) Algorithm

Now, we present the pseudo-code of Key-Path-based loose algorithm in Algorithm 1. All the details contained in KPL algorithm are discussed before except how to generate DAG with the guide of key path. Algorithm 1: Key-Path-Based Loose (KPL) Algorithm Input: G = (V, E), Source Vertex: Start, Reachable Preconditions Table: RP T , Priority queue: P Qkp 1 Execute Sim-Dijkstra algorithm; 2 Obtain optimal key paths and output all generated optimal DAGs by them; 3 if |generated DAGs|  k then return; 4 Push optimal key paths into P Qkp ; 5 while P Qkp = ∅ do 6 Key path KP ⇐ P Qkp .pop(); 7 Output and generate all the DAGs by KP if KP is not an optimal key path; if |generated DAGs|  k then return; 8 9 Loose(KP ) to generate new key paths and insert them into P Qkp . Make P Qkp .size() ≤ k − |generated DAGs|; 10 end Output: Top-k service composition results Generation of DAG By Key Path: This generation process is similar to the optimal DAG generation in [13]. A backward search is conducted from End node under the guide of key path. The provider of each key input can be obtained from key path, and that of non-key input can be anyone that will not change this key path. Example 4.2: We also take Fig.3 as an example to show how to retrieve top k (k=3) results by our KPL algorithm. As shown in the Step 1 of Table 5, we can extract the optimal key path, KP1 (55ms) after conducting Sim-Dijkstra algorithm. After that, we can generate only one service composition result by KP1 . This optimal result is the one at the top of Fig.6. Moreover, this key path is inserted into priority queue (lines 3-4 of Algorithm 1). Then, we execute lines 5-9 of Algorithm 1: KP1 is popped from priority queue because it is the only item in the priority queue. As shown in Step 5-9, we then loose KP1 backward and obtain several new key paths. The concrete loose operation process is presented in Example 4.1 already. At the next loop, KP4 (60ms) is with the best allQoS, so it is popped and the corresponding DAG is generated. In the same

way, we loose it backward and the loose vertex is only W2 7 , so KP5 (65ms) is generated. But it is not pushed into the priority queue, because its allQoS is bigger than KP3 .allQoS(63ms) and only one key path is needed for k − |generated DAGs|=1 now. At the next loop, KP3 is popped and the third DAG is obtained. Through this way, we fulfill the top 3 query. Furthermore, KPL algorithm has two important and useful properties: • Progressive: For progressive property, we mean that it can output the top k service compositions one by one, which is useful for time-sensitive query. The user does not need to wait longer time until all top k results are generated. What is more, users even do not need to specify the value of k, for they can stop accepting the new service composition results until the favorite ones are output. • Incremental: For incremental property, we mean that existing key paths can be reused for the generation of other new key paths to avoid calculating from scratch. The time complexity is O(nlogn+mlongm+klogk+km+kn). Three properties, terminability, optimality and completeness of KPL algorithm are also proved. Please refer to the Appendix for more details.

5

E VALUATION

In this section, we compare KPL algorithm with state of the art solutions. KPL algorithm is implemented and integrated in our tool, QSynth which won the performance championship in Web Service Challenge 2009, 2010 [20]. In our evaluation, the generated workload is based on public Web Service Challenge test set generator [20] and real Web services’ QoS values from [21]. In order to prove the scalability and efficiency of our algorithm in different and even extreme conditions, we generate three groups of data sets and each group contains six different test sets by varying three parameters: the number of concepts, the number of services and the solution-depth. Here, each Web service has the corresponding response time (the unit is ms). The three groups of Web services are shown in Table 6. All the experiments are carried out on a 2.4GHz machine with 4 GB RAM running Windows 7. 5.1

Query Time

To the best of our knowledge, WSCBT* algorithm[15] is the only existing work for top k query in QoS-aware service composition problem currently, which is based on a heuristic path search. It constructs composition result by combing several shortest paths. The following experiments are conducted by retrieving top 500 DAGs (K=500) except special statement. In this section, our experiment compares query time of KPL algorithm and WSCBT* algorithm in three cases: different 7. Because W11 , W5 have been loosed in KP1 , W4 is the last loose vertex on KP4 . Thus, W2 is the loose vertex in KP4 .

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON SERVICES COMPUTING 8

TABLE 5 An Example of KPL Algorithm (Top 3 Results) Key Paths In Priority Queue

Generated DAG Size

Execute Sim-Dijkstra and extract the optimal key path



0

Step 2

Generate DAG by KP1



1(By KP1 )

Step Step Step Step Step Step

Insert KP1 into P Qkp Pop P Qkp Loose KP1 Loose KP1 Loose KP1 Loose KP1

KP1 (55ms) ∅ End W11 W5 W4

KP2 (75ms) KP3 (63ms), KP2 (75ms) KP4 (60ms), KP3 (63ms)

1 1 1 1 1 1

Step 9

Loose KP1 ,

W3

KP4 (60ms),KP3 (63ms)

1

Step 10

Pop KP4 and generate DAG by it

KP3 (63ms)

2(By KP4 )

The second DAG

Step 11

Loose KP4

KP3 (63ms)

2

Step 12

Pop KP3 and generate DAG by it

Keep PQ size to 1, KP5 ’s allQoS is 65ms



3(By KP3 )

The third DAG

Step

Operation

Step 1

3 4 5 6 7 8

Loose Vertex

W2

Note

The first DAG is on the top of Fig.3

No new key path New key path KP2 New key path KP3 New key path KP4 No new key path. Loose KP1 is finished

TABLE 6 Three Groups Test Sets

XXX Inputs XXX Group X X Different Depths

Concept Number

Service Number

Solution Depth

8000

4000 1000,2000,4000, 6000,8000,10000

4,6,8,12,14,16

4000

10

Different Number of Services

16000

Different Number of Concepts

2000,5000,10000, 15000,20000,25000

depths, different number of concepts and different number of services. Results are shown in Fig.7(a,b,c), where query time of our KPL algorithm is less than 300ms and is always smaller than that of WSCBT* algorithm due to the scalability limitation of WSCBT*, as there are large combination results available to enable each service in the test sets. Note that the query time depends on the complexity of the dependency graph which is decided by service number, concept and depth together, so there could be no obvious trend when only one of them grows. Besides, we also conduct some experiments on these test sets by varying K. Fig.7(d) shows the case of #concept =16000, #service =4000 and #depth=10. The other cases are similar and are skipped here. From the result, we can see that query time grows gradually with the increment of k. The reason is that more DAGs need to be searched and generated. But query time of KPL algorithm is still much smaller than that of WSCBT* algorithm. 5.2

10

Fig. 8. Accuracy

Accuracy

This experiment compares the accuracy of both algorithms. The ratio of average allQoS of top k results is adopted to measure it. i=k ratio =

i=1

top i DAG QoS value by EW SCBT ∗

i=k i=1

k

smaller or better response time. This is because of the heuristic rule used in WSCBT*, which leads to false top k composition results. This experiment only shows that the accuracy of KPL algorithm is better. We present the definition and proof of 100% accuracy in the Appendix.

top i DAG QoS value by KP L k

The average overall QoS value of WSCBT*’s top k composition results divided by that of our KPL algorithm equals to the ratio. As shown in Fig.8, the ratio values are always larger than 1, which means top k results by KPL algorithm are with

5.3

Properties of Key Paths

We also conduct experiments to evaluate the following properties of key paths: • Length: It refers to the number of services contained in a

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON SERVICES COMPUTING 9

(a)

(b)

(c)

(d)

Fig. 7. The query time of different cases

(a)

Fig. 9. Properties of key paths

(b)

(c)

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON SERVICES COMPUTING 10

key path except Start and End nodes. Overall QoS: It refers to the overall response time of a key path. • New key paths: It refers to the number of newly generated key paths by each loose operation. These experiments are also conducted on the same test sets in Table 6. Because of page limitation, Fig.9(a)(b)(c) only show the case of #concept =8000, #service=5000, #depth=4. In this case, only rank 1-9 key paths are needed to guide the generation of top 500 DAGs. The ranks of them are the same to their popped orders from P Qkp . Besides, their lengths range from 2 to 4, for the key paths with smaller length tend to have better overall QoS. With the increment of their ranks, their overall QoS grow too. Moreover, Fig.9(c) shows that the number of newly generated key paths by loose operation ranges from 1 to 3. From all these details, we know more reasons about the good performance of KPL algorithm. In most cases, one key path can guide to generate several DAGs with the same overall QoS. Meanwhile, the top k key paths tend to be very short, which results in less running time of key path loose operation and DAG generation. •

5.4

Real QoS

In order to evaluate KPL algorithm in real scenarios, we also conduct experiments on 2000 real Web services’ QoS values from [21]. With regard to these real Web services’ QoS values, we replace the synthetic response time with them for the generated Web services8 . The results are presented in Appendix and their conclusions are similar to the above experiments. All the service compositions generated in those experiments can satisfy the request through our verification program. Recently, a competition called China Web Service Cup is held to solicit algorithm and software for top k query of QoS-aware automatic service composition problem. The details can be accessed from [22]. The result of this competition is presented in Appendix, which also shows that our solution has the best performance and accuracy.

6

M ULTIPLE Q O S

Until now, each service is assumed to have only one dimension of QoS value, e.g., response time. In fact, our algorithm can be extended to handle the case where each service is with multiple QoS measurements. One possible approach is using Multiple Attribute Decision Making approach, i.e., simple additive weighting [23], to transform all the QoS values into an aggregate QoS score before adopting our algorithm. But this simplifies the QoS requirements too much. Our approach is using the technique adopted in the research field of multidimensional datasets [24], [25]. Concretely, our approach for multiple QoS is: the user firstly chooses one dimension of QoS that is the most important to him/her, e.g, response time. The returned service composition results can not violate the threshold of this QoS requirement, e.g, the overall response time should be less than 60s. Thanks to the progressive property of KPL algorithm, we can satisfy the threshold easily by stopping loosing key paths when their overall QoS are worse than that threshold. In this way, only 8. We do not use real Web services because there is still lack of semantic information for real Web service. Different services may describe the same thing with various words, it is not trivial to do Web service match accurately without semantic information.

these DAGs that satisfy the request and the threshold are returned. After that, we calculate multidimensional overall QoS values for each returned DAG. Finally, we rank these returned DAGs with their multiple QoS values by FA [24] or TA [25] approaches adopted in the research of ranking on multidimensional data sets. Only the top k ones are provided to the user. The details are as follows. Assuming there are n QoS measurements, the rank mechanism works in the following way if we adopt FA algorithm: Step1: Conduct KPL algorithm on one QoS measurement with a threshold specified by the user. For each returned DAG that must be not worse than the threshold, we calculate its corresponding scores of multiple QoS measurements. Step2: For each QoS measurement, we build a sorted list with ascending order (Assuming the QoS is the smaller the better, if not, we can take its reciprocal as the QoS value.). Step 3: For each sorted list, we do normalization for each score named X in the list by the following formula.



score(X) − M in M ax − M in 1

M ax = M in

(3)

M ax = M in

(3 )

Step4: Do sorted access in parallel9 to these n sorted list until k “matches”. In other words, parallel access until there are k DAGs that each one of them has been accessed in the n sorted lists. Step 5: For each DAG that has been accessed, compute the aggregation score Agg(DAG) = f (score1 , score2 , ..., scoren ) e.g., the f is a weight sum function with w1 + w2 + ... + wn = 1. f (score1 , score2 , ..., scoren ) = w1 ∗ score1 + w2 ∗ score2 + ... + wn ∗ scoren

(4)

Step 6: Let R be the set containing the k DAGs that have been accessed and are with the smallest aggregation scores. We recommender R to the users.

7

D ISCUSSION

If we think of top k query for QoS-aware automatic service composition problem from a graph view deeply, it can be generalized as a new graph problem: how to search top k DAGs in the dependency graph. It is called “top k DAG problem” for short. Like other graph problems, e.g., the shortest path problem, top k DAGs problem is very useful to many other fields, such as API recommender and supply chain. 7.1

Formal Description of Top K DAGs Problem

Given a weighted, directed dependency graph G = (V, E). The V, E correspond to the vertices/nodes and edges in graph G respectively. In this dependency graph, a new concept, tag, is introduced. tag is attached with the edges in the graph. So this dependency graph is also called tagged graph. Related concepts about tagged graph are presented as follows. • Vertex: Vi (1 ≤ i ≤ n) is a vertex in a tagged graph, where i is the identification (id ) of Vi . n is the number of vertices in the graph. 9. The top member of each list is accessed first, we then access the second member of each list, and so on.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON SERVICES COMPUTING 11

• Edge: A directed edge of G is Ei,tag,k = (Vi , Vk , tag) (1 ≤ i, k ≤ n). This edge is from vertex Vi to vertex Vk with the identifier tag. Vi and Vk are the head and the tail of the edge. Meanwhile, Vi is the direct successor of Vk and Vk is the direct predecessor of Vi .  • IN i = {tag|Ej,tag,i ∈ E} (1 ≤ j ≤ n) ∧ (i = j) . It is a union set of all the tags of the edges whose tails are Vi . indegree i is the number of its incoming edges. intagdegree i is the number of different tags of its terminal edges, which is equal to |IN i |. The following formula always holds, intagdegree i ≤ indegree i .  • OUT i = {tag|Ei,tag,k ∈ E} (1 ≤ k ≤ n) ∧ (i = k) . It is a union set of all the tags of the edges whose heads are Vi . outdegree i is the number of its outcoming edges. outtagdegree i is the number of different tags of its incident edges, which is equal to |OUT i |. Similarly, outtagdegree i ≤ outdegree i . IN i also represents all the preconditions of vertex Vi . OUT i represents all the effects that Vi can produce or the preconditions that it can satisfy. tag, precondition, and effect mean the same thing if we do not distinguish them strictly. Definition 7.1: (Reachable). In a search that begins at a given source vertex, for any vertex Vi , we say Vi is reachable from the source vertex if and only if all its preconditions are satisfied. Note that the reachable property here is different from that of the classical graph where every vertex is reachable whenever there exists a path from the source vertex to it. In essence, the vertices in classical directed graph only have one precondition in our model. Definition 7.2: (Top K DAGs). In the tagged graph, given a query which is treated as two virtual nodes, Vs and Ve , all the subgraphs that belong to top k DAGs if and only if they satisfy: • These DAGs begin with Vs and end with Ve ; • The preconditions of all vertices in each DAG except the source vertex Vs are satisfied or reachable; • There is no redundant vertex in these DAGs. It means that if one vertex is removed from the DAG, there must exist some vertices that are not reachable from Vs ; • The overall weight of these DAGs are the top k ones among all the candidate DAGs that satisfy the above three conditions; Intuitively, k shortest paths problem is similar to our problem in some degrees. However, they still have several big differences. • Our problem is to retrieve DAGs, while k shortest paths problem is to find the paths. In fact, path is only a special case of DAG when the DAG only contains sequence composition pattern. • Our graph is different from classic graph. The node in our graph has several preconditions. We can search the successors of the node only when all its preconditions are fulfilled. While in a classic graph problem, we can search its successors whenever there is a path that can reach the node. • There are different composition patterns in our graph, such as split, join, sequence. So the computing rule of overall QoS depends on both QoS type and composition patterns. While the computing rule in the classic graph only considers sequence pattern. In fact, when the nodes in our dependency graph have only one input and one output, our problem is reduced to k shortest problem. In other words, k shortest path problem is a special case of top k DAGs problem.

Given a problem from other fields, if we can build a tagged graph for it and transmit the problem into top k DAGs problem, we then can make use of KPL algorithm to address it as we have done on top k query of QoS-aware automatic service composition. In fact, plenty of other problems in different research fields can be transformed into this new graph problem, such as supply chain and API recommender. This confirms the universality of top k DAGs problem. 7.2

Potential Application: API Recommender

In the field of software engineering, developers are always finding effective approaches to build software applications. Among these approaches, reusing existing library and framework is an effective and practical way. However, since the huge API and flexibility feature of library, reusing library is a non-trivial task [26]. One challenge problem involved is object instantiation. Specifically, given a query that is in the form of a pair of objects or class types (Source, Destination), return the method invoking sequences from Source type to Destination type. Although this is a common problem, it is still not easy whenever we face a new library: reading long and out of date documents to find the one we need is a difficult and baldness task if we do not have any hints. If we can fulfill this object instantiation process automatically and recommend related method invoking sequences to users, this can reduce developers’ workload a lot. Let’s present a simple example to illustrate the process of object instantiation first. As shown in Fig.10, we want to instantiate class type E from S. With S in hand, we can create an instance of A by invoking method M1 and M2. Meanwhile, we can create an instance of C by invoking method M3. In this way, the invoking sequences that contain methods, M1, M2, M3, M4 and M5 can be used to fulfill the task of instantiating class type E from S. This process of retrieving related methods and their orders among all the APIs from existing library and framework is called API recommender. In order to accomplish this process automatically, we adopt the following steps. At first, we construct a graph by retrieving all the API signatures of framework, where each node corresponds to a method. Intuitively, the method name is taken as Id of the corresponding node. All the parameter types of this method are taken as the inputs of the corresponding node. The return type of this method is taken as the output of the node. After that, we assign QoS to the nodes in the above graph. Concretely, we analyze the source code from other projects in code repository to retrieve the history record of using these APIs, for example, how many times do the developers use every API. Based on the use frequency of each API, we assign it to the QoS of nodes. Finally, our problem is simply to find such sub-graphs like Fig.10 with overall QoS as good as possible. In this way, API recommender is transmitted into top k DAGs problem too. According to our survey, existing works [26], [27], [28] address API recommender problem by the approaches based on shortest path, so they have some limitations on the efficiency and accuracy [14]. This is because that API recommender is belong to top k DAGs problem in essence rather than the shortest path problem. With regard to efficiency, existing approaches have to generate DAGs by combining several paths first and ranking them then. While our approach retrieves ranked key paths first and generate DAGs then if necessary. With regard

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON SERVICES COMPUTING 12

may make its aggregated QoS computation rule not accurate sometimes. 8.2 Returntype

MethodName

Inputparameters

B

M1

S

A

M2

B

C

M3

S

D

M4

A,C

E

M5

D



Fig. 10. An example of object instantiation

to accuracy, we mean that existing approaches based on path may result in “false” overall QoS. We have implemented a tool to fulfill API recommender based on KPL algorithm with some changes. Our experiment shows that our solution has smaller cost time and higher accuracy. The result is presented in Appendix. We also present another potential application, supply chain, in the Appendix.

8 8.1

R ELATED W ORK QoS-Aware Automatic Service Composition

In SOC paradigm, since it is always impossible to find composition results manually from huge amount of services, automatic service composition is proposed, which aims to enable automatic search of service compositions for given requests. There are mainly two categories of approaches: AI planning [2], [29], [3], [30] and graph search [5], [31], [32]. In addition to these centralized algorithms, a distributed approach is put forward in [33], which improves system performance by utilizing distributed computing resources. These solutions for automatic service composition do not consider non-functional attribution of service. They usually return the service composition results with fewer services or less depth. On the other hand, to guarantee local or global QoS requirements of service composition, service selection problem has attracted a lot of attention [10], [9], [34], where Integer Programming, multichoice 0-1 knapsack problem, multiconstraint optimal path problem and genetic algorithms are adopted. Moreover, the authors in [35] propose a community-centric approach for service selection. Research in [36] focuses on the routing aspect of QoS-aware service selection problem. So its goal is to achieve efficient network usage while guaranteeing the QoS of services, which is similar to network configuration problem. In general, these approaches for service selection usually assume the existence of a predefined abstract process or template, a set of ”abstract” tasks or service classes, and the service instances with the same functionality for each task. This assumption may be not true in the real service environment. Recently, a combined problem of automatic service composition and service selection, QoS-aware automatic service composition has attracted a lot of attention [13], [11], [12], which aims to automatically retrieve the optimal or near optimal service composition that satisfies the user’s request. But they only return the optimal result, which may bring in several problems like hot services as we have discussed in Section 1. With regard to top k query of QoS-aware automatic service composition problem, only W ang  06 [15] proposes a heuristic approach W SCBT ∗ . However, it has no guarantee of its precision. Moreover, that work does not distinguish different composition patterns and QoS types strictly, which

K shortest paths problem

The k shortest paths problem is very similar to our problem. It is to find k paths between a single pair of nodes in the graph with the smallest length. Quite a few dynamic programming problems can be mapped to it, e.g., knapsack problem, sequence alignment in molecular biology and length-limited Huffman coding. The author in [37] proposes an algorithm with the time of O(m+nlogn+k)10 to output implicit representation of the k shortest paths. The research in [38] solves the k shortest simple11 paths problem in a directed graph. Research in [39] goes further. It solves the problem of finding the k shortest paths in a digraph without the violation of several constraints among node pairs. A heuristic algorithm, A* prune, is proposed to deal with this problem. But the k shortest path problem is only a special case of our top k DAGs problem, the differences between them are described in Section 7. 8.3

Top k query of multidimensional data sets

The well-known approaches for this problem are FA [24] and TA [25]. Moreover, the paper [40] aims at the approximate top k queries and presents a family of approximate top k algorithms with probabilistic guarantees. But they focus on retrieving top k objects or nodes in multidimensional dataset. In short, these approaches can be used to rank nodes or DAGs with multiple QoS values, but they can not search good quality of DAGs to satisfy the request in the dependency graph. 8.4

Subgraph Matching/Query

Given two graphs A and B, subgraph isomorphic problem [41] is to judge whether A contains a subgraph that is isomorphic to B. It is known as NP-complete. This problem is also the base of many applications like pattern recognition and social communities. Subgraph matching can be classified into two classes: 1) given a graph database, the goal is to find all the graphs in the database that are subgraph isomorphic to the query graph [42], [43]. 2) given a large graph G, the goal is to find all subgraphs of G that are isomorphic to the query graph [44], [45]. Furthermore, the problem can be divided into deterministic subgraph matching and subgraph matching on uncertain graph [46], because many graphs or networks are uncertain or noisy. Paper [47] shows how to find top k subgraph matching in a large deterministic graph. In short, subgraph matching focuses on the structure similarity of graphs. While our problem focuses on the satisfaction of request and QoS, which is without a given query graph.

9

C ONCLUSIONS

Most current works on QoS-aware automatic service composition problem are about how to find (nearly) optimal composition result. This could bring many limitations to both service users and providers. To cope with these limitations, we address top k query for QoS-aware automatic service composition problem. An efficient KPL algorithm is designed and 10. Here, let m be the number of edges, n be the number of vertices. And the weights are nonnegative. 11. Simple means loop free. The k shortest simple paths problem has been proved to be more difficult than the k shortest paths problem.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON SERVICES COMPUTING 13

implemented in this paper. Experiments on public data sets and real Web service QoS show that KPL algorithm has good scalability, efficiency and accuracy, which is better than related work. Moreover, we present how to extend KPL algorithm to support multiple QoS measurements. Finally, a new graph problem, top k DAGs problem, is generalized. There are other problems from various applications, such as API recommender and supply chain, can be transformed into this top k DAGs problem, and techniques developed in this paper can be used to solve them.

ACKNOWLEDGMENTS For the completion of this research, we would like to thank the National Natural Science Foundation of China under Grant No. 61070027, 61020106002, 611611605, 61161160566, 60921002. This work is also supported by State Key Laboratory of Software Engineering (SKLSE2012-09-02) and the Science and Technology Project of the State Grid Corporation of China under Grant No. SG [2012] 815.

R EFERENCES [1] [2] [3]

[4] [5] [6] [7]

[8]

[9] [10]

[11] [12] [13] [14] [15]

Wei Jiang, Dongwon Lee, and Songlin Hu. Large-scale longitudinal analysis of soap-based and restful web services. In ICWS ’12, pages 218–225, 2012. S. McIlraith and T. Son. Adapting golog for composition of semantic web services. In KR2002, pages 482–493, April 22-25 2002. Evren Sirin, Bijan Parsia, Dan Wu, James Hendler, and Dana Nau. Htn planning for web service composition using shop2. Web Semantics: Science, Services and Agents on the World Wide Web, 1(4):377–396, October 2004. A. Zhou, S. Huang, and X. Wang. Bits: A binary tree based web service composition system. Int. J. Web Service Res., 4(1):40–58, 2007. Seyyed Vahid Hashemian and Farhad Mavaddat. A graph-based framework for composition of stateless web services. In ECOWS, pages 75–86, 2006. A. Marconi, M. Pistore, and P. Poccianti. Automated web service composition at work: the amazon/mps case study. In ICWS’07, pages 767–774, July 2007. Jinghai Rao, D. Dimitrov, P. Hofmann, and N. Sadeh. A mixed initiative approach to semantic web service discovery and composition: Sap’s guided procedures framework. In ICWS’06, pages 401–410, Sept. 2006. Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazons highly available key-value store. In Proc. SOSP, pages 205–220, 2007. Tao Yu, Yue Zhang, and Kwei-Jay Lin. Efficient algorithms for web services selection with end-to-end QoS constraints. ACM Trans. Web, 1(1):6, 2007. Liangzhao Zeng, B. Benatallah, A.H.H. Ngu, M. Dumas, J. Kalagnanam, and H. Chang. QoS-aware middleware for web services composition. IEEE Transactions on Software Engineering, 30(5):311–327, May 2004. Peter Bartalos and Maria Bielikova. Semantic web service composition framework based on parallel processing. In CEC ’09, pages 495–498, 2009. Yixin Yan, Bin Xu, Zhifeng Gu, and Sen Luo. A QoS-driven approach for semantic service composition. In CEC ’09, pages 523 – 526, 2009. Wei Jiang, Charles Zhang, Zhenqiu Huang, Mingwen Chen, Songlin Hu, and Zhiyong Liu. QSynth: A tool for QoS-aware automatic service composition. In ICWS ’10, pages 42–49, 2010. Wei Jiang, Songlin Hu, and Zhiyong Liu. QoS-aware automatic service composition: A graph view. Journal of Computer Science and Technology. [online] http://debs.ict.ac.cn/jcst.pdf, 26(5):837–853, 2011. Xiaoling Wang, Sheng Huang, and Aoying Zhou. QoS-aware composite services retrieval. J. Comput. Sci. Technol., 21(4):547– 558, 2006.

¨ [16] Michael C. Jaeger, Gregor Rojec-Goldmann, and Gero Muhl. QoS aggregation for web service composition using workflow patterns. In EDOC, pages 149–159, 2004. [17] Massimo Paolucci, Takahiro Kawamura, Terry R. Payne, and Katia P. Sycara. Semantic matching of web services capabilities. In International Semantic Web Conference, pages 333–347, 2002. [18] Competition rules of wsc 2009. [online] http://wschallenge.georgetown.edu/wsc09/downloads/wsc2009rules1.1.pdf. [19] Wei Jiang, Songlin Hu, Dongwon Lee, Shuai Gong, and Zhiyong Liu. Continuous query support in adaptive service composition. In ICWS ’12, pages 50–57, 2012. [20] Web service challenge 2009.[online] http://wschallenge.georgetown.edu/wsc09/. [21] Eyhab Al-Masri and Qusay H. Mahmoud. QoS-based discovery and ranking of web services. In ICCCN, pages 529–534, 2007. [22] China web service cup. [online] http://debs.ict.ac.cn/cwsc2011/result.html. [23] D. R. Fulkerson L. R. Ford. Multiple attribute decision making: an introduction. Sage Publications, 1995. [24] Ronald Fagin. Combining fuzzy information from multiple systems (extended abstract). In PODS ’96, pages 216–226, 1996. [25] Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algorithms for middleware. Journal of Computer and System Sciences, 66(4):614 – 656, 2003. Special Issue on PODS 2001. [26] David Mandelin, Lin Xu, Rastislav Bodłk, and Doug Kimelman. Jungloid mining: Helping to navigate the api jungle. In Proceedings of the 2005 SIGPLAN Conference on Programming Languages Design and Implementation, pages 48–61, 2005. [27] Suresh Thummalapenta and Tao Xie. Parseweb: a programmer assistant for reusing open source code on the web. ASE ’07, pages 204–213, 2007. [28] Naiyana Sahavechaphan and Kajal Claypool. Xsnippet: mining for sample code. SIGPLAN Not., 41:413–430, October 2006. [29] Drew V. McDermott. Estimated-regression planning for interactions with web services. In AIPS, pages 204–211, 2002. [30] Shankar R. Ponnekanti and Armando Fox. Sword: A developer toolkit for web service composition. In WWW2002, 2002. [31] Qianhui Althea Liang and Stanley Y. W. Su. And/or graph and search algorithm for discovering composite web services. Int. J. Web Service Res., 2(4):48–67, 2005. [32] Nikola Milanovic and Miroslaw Malek. Search strategies for automatic web service composition. Int. J. Web Service Res., 3(2):1– 32, 2006. [33] Songlin Hu, Vinod Muthusamy, Guoli Li, and Hans-Arno Jacobsen. Distributed automatic service composition in large-scale systems. In DEBS ’08, pages 233–244, 2008. [34] Gerardo Canfora, Massimiliano Di Penta, Raffaele Esposito, and Maria Luisa Villani. An approach for QoS-aware service composition based on genetic algorithms. In GECCO, pages 1069–1075, 2005. [35] Xuanzhe Liu, Gang Huang, and Hong Mei. A community-centric approach to automated service composition. Science in China Series F: Information Sciences, 53(1):50–63, 2010. [36] Jin Liang and Klara Nahrstedt. Service composition for generic service graphs. Multimedia Systems, 11:568–581, 2006. [37] David Eppstein. Finding the k shortest paths. SIAM J. Computing, 28(2):652–673, 1998. [38] John Hershberger, Matthew Maxel, and Subhash Suri. Finding the k shortest simple paths: A new algorithm and its implementation. ACM Trans. Algorithms, 3(4), 2007. [39] Gang Liu and K. G. Ramakrishnan. A*prune: An algorithm for finding k shortest paths subject to multiple constraints. In INFOCOM, pages 743–749, 2001. [40] Martin Theobald, Gerhard Weikum, and Ralf Schenkel. Top-k query evaluation with probabilistic guarantees. In VLDB ’04, pages 648–659. VLDB Endowment, 2004. [41] Scott Fortin. The graph isomorphism problem. Technical report, University of Alberta, 1996. [42] Xifeng Yan, Philip S. Yu, and Jiawei Han. Graph indexing: a frequent structure-based approach. SIGMOD ’04, pages 335–346, 2004. [43] Huahai He and A.K. Singh. Closure-tree: An index structure for graph queries. In ICDE ’06., page 38, april 2006. [44] Luigi P. Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. A (sub)graph isomorphism algorithm for matching large

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON SERVICES COMPUTING 14

graphs. IEEE Trans. Pattern Anal. Mach. Intell., 26(10):1367–1372, October 2004. [45] Zhao Sun, Hongzhi Wang, Haixun Wang, Bin Shao, and Jianzhong Li. Efficient subgraph matching on billion node graphs. Proc. VLDB Endow., 5(9):788–799, May 2012. [46] Ye Yuan, Guoren Wang, Haixun Wang, and Lei Chen. Efficient subgraph search over large uncertain graphs. PVLDB, 4(11):876– 886, 2011. [47] Lei Zou, Lei Chen, and Yansheng Lu. Top-k subgraph matching query in a large graph. In Proceedings of the ACM first Ph.D. workshop in CIKM, PIKM ’07, pages 139–146, 2007. Wei Jiang received his Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences. He visited Penn State University from 2010-2011. His research interests include service computing and distributed event based system. Now, he is with with Greatwall Drilling Company R&D Academy of Well Logging, CNPC.

Songlin Hu received his Ph.D. degree from Beijing University of Aeronautics and Astronautics in 2001. He works in Institute of Computing Technology, Chinese Academy of Sciences as an associate professor since 2002, and went to Middleware System Research Group at the University of Toronto as a visiting scholar in 2005. His research interests include distributed event based system, service computing, etc. Zhiyong Liu received his M.S. degree from Northwest Telecommunication Institute and Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences in 1983 and 1987, respectively. He worked as a visiting scholar and a post doctor fellow in U.S.A. and Canada from 1988 to 1992. He is currently a Professor in the Institute of Computing Technology. His research interests include parallel and distributed algorithms and architectures.

Suggest Documents