Change Propagation based Incremental Data ...

5 downloads 102 Views 199KB Size Report
runs to find new service descriptions or to check the continued availability ..... files before they can be made available for domain specific discovery will need to ...
[Pre-print of conference paper in the proceedings of IEEE-ISSPIT 2014 ©IEEE]

Change Propagation based Incremental Data handling in a Web Service Discovery Framework Sowmya Kamath S.

Ananthanarayana V.S.

Department of Information Technology, National Institute of Technology Karnataka, Surathkal, Mangalore - 575025, India Email: [email protected]

Department of Information Technology, National Institute of Technology Karnataka, Surathkal, Mangalore - 575025, India Email: [email protected]

Abstract—Due to the explosive growth in availability of Web services over the open Web and the heterogeneous sources in which they are available, discovering relevant web services for a given task continues to be challenging. In order to deal with these problems, a bottom-up approach based on finding published service descriptions to automatically build a service repository for developing a web service discovery framework was proposed. This framework employs a specialized mechanism to exclusively find published service descriptions on the Web. Since there will be periodic, repeated runs to find new service descriptions or to check the continued availability of already crawled services, the framework is of an inherently dynamic nature. Hence, it is critical to keep track of various entities like visited URLs, already added & successfully processed service descriptions to avoid rework when the service data set changes. In order to cope with this problem, we propose a change propagation technique using an event based state machine to incorporate an incremental processing strategy in this Web scale framework. Keywords: web service discovery, service crawler, change management, incremental processing

I. Introduction Web services have become very popular in recent years in enterprise level applications and in lightweight web application development due to its promise for achieving inter-operable machine to machine interaction and standardization efforts. Also, composability is another feature which facilitates building of newer, larger applications using basic services as building blocks, to seamlessly provide more value added services [1]. This has resulted in web service based complex systems like those in e-commerce and travel domains. One of the main requirements towards achieving automatic integration is service discoverability. Service designers need to find suitable services to integrate to create new applications. These services were earlier available in the Universal Business Registry service [2], however, this was shut down in 2006. Since, many service providers now choose to host their services and their descriptions on their own servers or on service portals, the number of services freely available on the Web has increased drastically [3]. Service descriptions are most often available embedded within webpages or can be accessed through their endpoint

URIs. This is similar to the way web pages are retrieved, thus, one could potentially employ techniques adopted from information retrieval (IR) for web service discovery.Several researchers have used both IR based techniques [4][5] and data mining based techniques [6] for web service discovery and retrieval. The algorithms that search engines use to provide better and more accurate results to users are already quite sophisticated as they nowadays use an amalgamation of several cutting edge techniques like usage analytics and natural language processing (NLP). These techniques can also be incorporated in a system that facilitates efficient web discovery. The focus of the proposed framework is to first find published service descriptions from the Web to build a scalable service repository, then analyze and automatically generate tags for service descriptions based on their natural language descriptions and other features. These tagged services are then clustered, to further improve the efficiency of the discovery process by reducing the search space. The framework will finally provide an interface for querying to users to find relevant services. In this paper, we present a change management strategy to effectively handle the inherently dynamic nature of this framework. The proposed change propagation technique is modeled as an event based state machine to incorporate an incremental processing strategy in this Web scale framework. The rest of the paper is organized as follows. Section 2 discusses existing work in the area of web service discovery, specifically the problem of finding services on the open Web. Section 3 presents the proposed system and its components. In section 4, we discuss the details of the proposed change management technique. Section 5 discusses some observations about the proposed strategy followed by conclusion & references. II.

Related Work

The problem of finding published services on the Web is currently an area of active research interest. The challenge in this is that services are distributed and also are available from a variety of sources ranging from service providers’ servers to service portals. To overcome this, Wu and Chen [4] proposed a data mining based system that focuses on retrieving required web services openly available on the Web. Song et al [5] used conventional web search engines to discover Web services. They used

14th IEEE Symposium on Signal Processing and Information Technology, New Delhi

nine different approaches for publishing web services and used 18 different queries for retrieving these services using Yahoo and Google. They noted that embedding a WSDL specification within a Web page that provides a detailed description of the service yielded the best results. Al Masri et al [7] proposed the concept of a “service crawler” with a scalable, updatable centralized repository that functions as an UDDI. They also claimed that after the public UDDI registry was discontinued, more than 53% of the services currently available in various service portals were inactive, in contrast to 92% of services cached by search engines being active [2]. Effective techniques for incremental processing of changes have been proposed in the area of large databases with high rate of updates thus resulting in dynamic data [8][9]. Change propagation strategies have been presented for effective software system design and product architecture [10][11]. Dam et al [11] proposed an automatic change propagation that is based on an agent architecture called the Belief-Desire-Intention (BDI) [12] which is modeled on how humans reason in practical scenarios about various resources. This framework most importantly provided traceability functions in a complex software system. Similarly, several researchers have stressed the need for and proposed change propagation techniques for large scale information systems that handle huge number of upations, deletions and insertions [13][14][15]. Bu et al [16] proposed HaLoop, a novel parallel and distributed system that is a modified version of Hadoop, that supports large-scale iterative data analysis applications, based on an incremental change propagation strategy [17]. Our aim is to propose an effective change management strategy to handle the inherently dynamic nature of this framework. We modeled the change handling mechanism as an event based state machine to handle the events associated with the various data entities in our Web scale framework. III. Proposed System A. Finding service descriptions on the Web Currently, potential sources of published services include certain public web service portals and service providers’ websites. Earlier, the UDDI Business Registry (UBR) was also available, however the setup has been permanently discontinued since 2006. There are several portals where service providers can add their service details but these are non-UDDI repositories. Some of them are quite well maintained and regularly updated, while some others contain out-of-date service data. Fan et al [18] reported that even with these issues, these online registries were found to contain quite a large number of service descriptions, which can be searched and used by service consumers. Therefore, these can be a good starting point. When conventional web crawling techniques used for web pages are to be employed for finding WSDLs, we need to consider the many important factors that effectively distinguish the two. Traditional web crawlers meant for web pages are not suitable for finding web service descriptions as web pages have lots of textual information whereas a

web service often has a short textual description. So, popular IR methods like Term Frequency/Inverse Document Frequency (tf-idf) are not very suitable. Web pages are mostly written in HTML with a predefined set of tags, whereas service descriptions are in XML with its user defined tags which needs a knowledge about XML schemas and namespaces. Finally, a web service description does not follow the link structure that web pages have, that links one WSDL file with another. WSDL is meant to describe the capabilities of a single Web service, not for connecting different Web services. Hence, it is not possible to use a newly discovered service as a feed to find more of the same as how conventional web crawlers do. Due to these problems, directly using web page crawling techniques for finding WSDLs is quite difficult and a more sophisticated approach is required. Hence, a hybrid crawler mechanism is needed to specifically handle web services, for collecting service descriptions both from nonUDDI service portals and service providers’ websites. In order to achieve this we perform the following steps: •

A specialized crawl of the Web to find and retrieve only service descriptions.



Semantic analysis of the service descriptions to enable domain specific filtering.



Collect related information with each service as additional metadata, if available.



Finally, due to periodic crawler runs, we need to also make sure no duplicates are introduced.

B. Building a Service Repository In order to build a repository for published web service descriptions, we first identified several non-UDDI service repositories like ProgrammableWeb, BioCatalogue, WebServiceList, XMethods etc where service descriptions are available. Most of these just provided a small description and link to the service providers’ websites while the first two provided access to their data through a standard, open API. Hence, we handle the repositories providing open APIs separately by simply calling these APIs to collect data, while a service crawler based approach is used for finding service descriptions available on web servers and non-API service portals. Algorithm 1 shows the working of the service crawler. In order to restrict the service crawler to only Web service descriptions, certain additional rules are incorporated in the crawler algorithm. We added necessary checks to ensure that we retrieve only service descriptions. To identify whether a retrieved web document is a web service description, we apply three rules. Firstly, verifying if the URL of retrieved webpage matches the WSDL regular expression (URL ends in “.wsdl” or “?WSDL”), secondly, checking the HTTP response header for web service MIME types (“application/wsdl+xml” or “text/xml”) and thirdly, checking the source code of the webpage to determine if any web service description indicators are present. It was seen that out of these three tests, at least two have to hold true to make a confident decision. If only the content-type response header is considered, there is an

[Pre-print of conference paper]

increased possibility of false positives, especially if the MIME type is “text/xml”. Algorithm 1 Service Crawler 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19:

while Crawler queue Q with feed urls ̸= empty do Perform webpage.getURL() for each feed url Extract all hyperlinks from webpageurl ▶ List of new feedURLs. for each hyperlink extracted from webpageurl do if hyperlink is not on visited-URLs list then apply checking rules (1) to (3) if (atleast two of three are positive) then ▶ hyperlink contains WSDL check for validity of WSDL structure if WSDL is valid then Download file Set file status to “NewWSDL” end if if WSDL is invalid then add webpage.url to invalid-list ▶ to avoid reprocessing. end if end if end if end for end while

If at least two of the tests are true, then a valid service description may have been found. The crawler downloads this document and applies a WSDL structure check. If found valid, the service description is added to the system’s service repository. Next, the WSDL downloader downloads the file from its location on the Web and stores it in the WSDL Pre-processor database. After this, the WSDL Parser parses the entire WSDL file to first check if it has a valid structure, i.e. the standard DOM tree. If found ok, then the algorithm will generate

the hash of the entire file content using which the WSDL file will be indexed in the WSDL Repository. If WSDL is invalid, the URL will be added to invalid-wsdlurl list which is used by the crawler to avoid processing these unnecessarily during future crawling rounds. The algorithm then extracts the service details like service URL, operation names, input/output names, message parts, service description etc. One additional step is to try and collect some metadata about each service. Some Web Service Registries provide additional information about the services, such as availability, version, service provider’s details, QoS parameters, feedback from other users etc. The Metadata collector component collects any such additional information about the service. However, if there is no available additional information, then there will be no metadata file for that service. The process continues with the next URL on the web page given as the input URL. Each time a new URL with a service is found, the WSDL file is downloaded & parsed and its hash is generated. The algorithm then checks the indexed service hashes to determine if the WSDL file is a duplicate before adding it to the WSDL Repository. Figure 1 presents the work flow of the service crawling and WSDL collection process. IV.

Incremental Strategy for Dynamic Data

In the proposed system, we have to handle several thousands of data entities like feed URLs which are crawled and the WSDLs added during successive runs of the service crawler. Hence the WSDL repository may not be built during one crawl but will require multiple crawls to contain a sufficiently large collection of service descriptions. Hence, any further processing of the WSDL files before they can be made available for domain specific discovery will need to be handled in an intelligent way in order to avoid rerunning the conversion algorithm on the whole data each time new content is crawled. Therefore,

TABLE I: States in the Proposed Incremental Strategy Sl.no

State

Meaning

Associated Action

1

URLNoMatch

URL of crawled Webpage does not match the WSDL regular expression.

Update feed-URL List & sublink list to exclude the identified URL.

2

noServiceURL

URL that matched against the regular expression does not contain a valid service URL to further action.

Update feed-URL List & sublink list to exclude the identified URL.

3

invalidWSDL

Downloaded WSDL did not pass DOM tree verification test.

Update feed-URL List to exclude identified invalid WSDL’s webpage URL.

4

duplicateWSDL

Hash of service data already exists in WSDL Pre-processor database.

Update feed-URL List to exclude identified service URL as already existing.

5

NewWSDL

Validated WSDL file is stored in WSDL Repository indexed by its hash.

Initiate extraction of service details & collection of metadata for service.

6

ModifiedWSDL

A new version of existing WSDL file was found. (status is given only after all other checks have been passed)

Delete older version of WSDL file & add the new file to WSDL Repository with status “NewWSDL”.

7

WSDLProcessed

All possible information is extracted from WSDL & is ready for use.

Add the WSDL to the service repository with status “ToBeClustered”.

8

ToBeClustered

Status of the WSDL which was successfully added to the repository.

To be passed on to the clustering algorithm.

9

Active

Status of WSDL files that were successfully clustered.

Available for user querying.

[Pre-print of conference paper]

Fig. 1: Workflow of the Service Crawling Process

an incremental strategy is needed to process the dynamic data in a cost and time-effective manner. The problem here is that we have a large volume of data in the form of service descriptions crawled from the Web. Over time, this volume of data is subject to many (relatively) small changes each time new service descriptions are crawled and added. Since we are processing this volume of data in other ways – for example, keeping track of already visited URLs, adding newly verified WSDL files to the repository, clustering these to achieve search

space reduction and finally making them available for discovery, we want to avoid recomputing entire volume of data after every small change. This is a big challenge and in order to manage this efficiently we propose the use of an incremental change management technique. The basic idea of an Incremental Algorithm [19] is that we only process the new changes, and propagate the results down the line to the next components without having the redo the computations for the entire data [20]. The proposed incremental strategy is based on the

[Pre-print of conference paper]

concept of a state machine [21]. Each entity is assigned a “status” that keeps track of the various states it can be in. Table I indicates the values of these states in the order corresponding to the various phases defined in the proposed framework. The main focus here is to ensure propagation of changes between components so that consistency is maintained and to process changes automatically without any need for manual intervention. As shown in table 1, a data entity in the system can be in various states ranging from “URLNoMatch” to “Active” depending upon the various processes and transformations it has to undergo. Depending on which state a data entity is in, it is also possible to correctly identify the data entity in question. For example, in states “URLNoMatch” to “duplicateWSDL”, the concerned data entity is an URL, which is important since keeping track of the state of an URL is useful in optimizing the functioning of the service crawler during subsequent crawls. In states “NewWSDL” through “Active”, the WSDL file is the data entity under consideration. When new WSDL files are added to WSDL Repository after successful metadata collection, their status is “WSDLProcessed”, and the algorithm selects only these files to pass to the next module, which is the Clustering algorithm. Hence, we are effectively processing only new additions, rather than all the data in the repository. So we have achieved an Incremental Processing scheme for the proposed framework based on automatic change propagation. A. Event Driven Change Propagation Even in an incremental approach, it is inefficient to be continuously scanning the database for the various state changes for each small change and then applying the appropriate actions to change the state of the associated data entity. Hence, we propose the use of event driven programming for deciding when each of the modules should do their job. The proposed system’s state machine is an event driven model, which keeps track of various events. When the number of events exceeds a fixed threshold, the corresponding actions are automatically initiated and the change propagation happens automatically. Algorithm 2 depicts the sequence of actions that are to be followed in case of an “URLNoMatch” status. This is simply logged as an event on the event counter and the URL is added to the URLNoMatch list. After this, the system continues processing the next URL in the crawler list, continuously keeping track of subsequent events in an appropriate list. When a prefixed threshold is reached for URL related events, then the system automatically updates the crawler’s feed-URL list by deleting the URLs on the URLNoMatch list from it. The same procedure is followed for events like “noServiceURL”, “InvalidWSDL” and “duplicateWSDL”. V. Analysis and Discussion To collect information about available services, the service crawler is designed to fetch the registered information

Algorithm 2 Event driven Automatic Change Propagation Algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

Initialize event-count := 0 for each event-type do while event-count ≤ threshold do log each data entity’s state change ▶ keep track of all URL/WSDL events Increment count of that type of event if event-count > threshold then Perform action associated with event-type reset event-count end if end while end for

of each service. Services on portals usually include information about name, provider details, feedback of other users etc. This information was also retrieved and also added the accessible WSDL files by following the links provided, after relevant checks. Sometimes, it is possible that the given links do not point to the actual service descriptions, but to some webpage on the service provider’s site that serves as an introductory page to the service capabilities and documentation. In most cases, this problem is easily solved by appending ?WSDL to the end of the given URL. If still the URL does not yield the WSDL file, we apply the BFS crawling strategy by adding this link to the feedurls listand try to find WSDL files associated with the services provided by that particular service provider. This particular feed-url will be tested during the next scheduled crawler run. In case it still does not yield any WSDLs, then it is added to the invalid-url list to avoid reprocessing the same link during subsequent crawler runs. Web pages that match the required pattern may sometimes yield WSDL files that are not valid since they may not conform to WSDL standard document structure. We consider such entries as invalid ones. Each discovered WSDL file is checked to see if it has a valid WSDL DOM tree structure and only then added to the WSDL Repository. Due to the repeated crawler runs, there is a possibility of duplicates being added, since same service descriptions may be available on the service providers’ site as well as a service portal. To ensure that the repository has only unique WSDL, we generate a hash of the WSDL contents and use this as an index while saving the processed file in WSDL Repository. Before adding a newly found WSDL to the repository, its hash is checked against a hash index to ensure that it is not a duplicate of an existing WSDL. The crawling process starts with a list of initial input URLs called feed-urls and crawls to a predefined depth d. At each level till depth d, the crawler explores all new neighboring links and then does the same for each newly discovered link and so on till depth d is reached. Hence, considering that the branching factor of each link explored is b and the depth was set to d, the time taken by the algorithm is exponential and is given by O(bd ). At each level, each explored link must be saved in order to revisit later, so that its child nodes can be generated. Due to this,

[Pre-print of conference paper]

the space complexity of the algorithm is dependent on the number of nodes at the deepest level, hence the asymptotic space complexity is also O(bd ). Hence, it is very crucial to keep track of irrelevant links that were either invalid or those that did not match the WSDL regular expression to avoid processing them during later crawler runs. This is possible because of the proposed change management technique. Table II presents some information on the crawling performance of the multi-threaded Service Crawler. The number of URLs collected by the crawler over time, number of URLs that matched the WSDL URL pattern and the number of valid WSDL files obtained are tabulated. It can be seen that the number of irrelevant URLs that have to be processed in order to obtain a few URLs matching the WSDL URL pattern that may or may not lead to valid WSDL files is quite high. This stresses the importance of optimizing the process of data handling to avoid unproductive, additional processing of data entities like irrelevant URLs and WSDL files after each crawler run. TABLE II: Crawl data specifics of the Multi-threaded Service Crawler Number of URLs fetched

Time taken to collect (in minutes)

URLs matching WSDL URL pattern

Links with valid WSDL files

1000

6

87

39

5000

21

271

134

10000

39

507

298

15000

58

722

412

20000

82

1037

609

VI.

Conclusion & Future Work

In this work, we proposed an effective change management strategy for a web scale framework that focuses on collecting published service descriptions from the Web. We have currently collected more than 17,000 service description from service portals and the open Web. New data is added and existing data may be modified during periodic crawler runs so handling the various data entities which can get cumbersome due to the dynamic nature of the framework. Hence, an event driven, state machine based change propagation algorithm has been incorporated in the system, which resulted in very efficient handling of the dynamic dataset. As part of future work, we intend to analyze this algorithm and optimize it further. Also, the framework based on this incremental data handling intended for web service discovery is under development.

[4] C. Wu and E. Chang, ““searching services on the web”- a public web services discovery approach,” in Third Conf. on SignalImage Technologies and Internet based Systems, IEEE, 2008. [5] H. Song et al., “Web service discovery using general-purpose search engines,” in Web Services, IEEE Intl Conf on, pp. 265– 271, IEEE, 2007. [6] J. Wu et al., “Clustering web services to facilitate service discovery,” Knowledge and information systems, vol. 38, no. 1, 2014. [7] E. Al-Masri and Q. Mahmoud, “A broker for universal access to web services,” in 7th Annual Communication Networks and Services Research Conf., IEEE, 2009. [8] F. Masseglia et al., “Incremental mining of sequential patterns in large databases,” Data & Knowledge Engineering, vol. 46, no. 1, pp. 97–121, 2003. [9] T. Griffin and B. Kumar, “Algebraic change propagation for semijoin and outerjoin queries,” SIGMOD Record, vol. 27, no. 3, pp. 22–27, 1998. [10] T. Jarratt et al., “Product architecture and the propagation of engineernig change,” in DESIGN 2002, the 7th International Design Conference, Dubrovnik, 2002. [11] K. H. Dam et al., “An agent-oriented approach to change propagation in software evolution,” in Software Engineering Conference, 2006. Australian, IEEE, 2006. [12] M. Bratman, “Intention, plans, and practical reason,” 1987. [13] C. Constantinescu et al., “A system for data change propagation in heterogeneous information systems.,” in ICEIS, pp. 73– 80, 2002. [14] R. Rantzau et al., “Champagne: data change propagation for heterogeneous information systems,” in 28th Intl conf on Very Large Data Bases, pp. 1099–1102, VLDB Endowment, 2002. [15] P. Bhatotia et al., “Large-scale incremental data processing with change propagation,” in 3rd USENIX conference on Hot topics in cloud computing, USENIX Association, 2011. [16] Y. Bu et al., “Haloop: Efficient iterative data processing on large clusters,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 285–296, 2010. [17] Y. Bu et al., “The haloop approach to large-scale iterative data analysis,” The VLDB Journal—The International Journal on Very Large Data Bases, vol. 21, no. 2, 2012. [18] J. Fan and S. Kambhampati, “A snapshot of public web services,” ACM SIGMOD Record, vol. 34, no. 1, pp. 24–32, 2005. [19] F. Can ACM Transactions on Information Systems (TOIS), vol. 11, no. 2, pp. 143–164, 1993. [20] C. G. Lopes and A. H. Sayed, “Incremental adaptive strategies over distributed networks,” Signal Processing, IEEE Transactions on, vol. 55, no. 8, pp. 4064–4077, 2007. [21] J. E. Hopcroft, Introduction to automata theory, languages, and computation. Pearson Education, 1979.

References [1] [2]

[3]

M. Klusch, “Service discovery,” in Encyclopedia of Social Networks and Mining (ESNAM), Springer, 2014. E. Al-Masri and Q. Mahmoud, “Investigating web services on the world wide web,” in 17th Intl Conf on World Wide Web, ACM, 2008. D. Bachlechner et al., “Web service discovery-a reality check,” in European Semantic Web Conference, vol. 308, 2010.

[Pre-print of conference paper]

Suggest Documents