Distributed Computing Through Web Browser

8 downloads 9751 Views 92KB Size Report
browsers as clients, thanks to the availability of Javascript and. AJAX. The described solution .... in a very responsive manner, as like as a desktop application.
Distributed Computing Through Web Browser F. Boldrin, C.Taddia, G. Mazzini ENDIF, University of Ferrara, Via Saragat 1, 44100 Ferrara, Italy. [email protected], [email protected], [email protected]

Abstract— This paper proposes a new approach for distributed computing. The main novelty consists in the exploitation of web browsers as clients, thanks to the availability of Javascript and AJAX. The described solution has two main advantages: it is clientfree, so no additional programs have to be installed to perform the computation, and it requires low CPU usage, so clientside computation is no invasive for users. The solution is developed using AJAX technology embedding a pseudoclient into a web page that hosts che computation. While users browse the hosting web page, computation takes place resolving single subproblems and sending the solution to the serverside part of the system. The new architecture has been tested through various performance metrics by implementing two examples of distributed computing, the RSA crack and the correlation index between genetic data sets. Results have shown good feasibility of this approach.

I. I NTRODUCTION Distributed computing consists on the solution of a problem that is divided into a number of parts, called subproblems; these subproblems are separately solved, usually by independent computers/processors. Results of solved subproblems are then reassembled in the correct manner to form the final solution of the original bigger problem. There are many different hardware and software architectures that can be used for distributed computing, such as for example peer-to-peer or client-server. In peer-to-peer architecture there is no special machine that provides a service or manages the network resources but all the responsibilities are uniformly divided among all machines. A client-server architecture is based on a client that, through a specific code, contacts the server for data; the server side of the computation usually manages the distribution process giving a subproblem as response of the client query. Server can also assemble the subproblems to give the final solution of the original problem. A great variety of distributed computing projects have grown up in recent years, for example the Folding@home [1] project of the Stanford University Chemistry Department, or the SETI@home [2] of the Space Sciences Laboratory at the University of California, Berkeley and also the LHC@home [3], a project by the CERN for simulation on the new Large Hadron Collider. A common aspect of these architectures and of the cited examples is the need of each peers or clients to install a specific code to solve its subproblem. The work we present in this paper develops a distributed computing system based on a client-server architecture, that uses the extremely large number of machines connected to the Internet as clients, in a non-invasive way. Internet has already

1-4244-0264-6/07/$25.00 ©2007 IEEE

been used as a network for distributed computing such as in the various @home projects cited before and in many others more or less publicized. The main innovation of our system is the idea of realizing a distributed computing in Internet without any additional software installation at the client side, but using only the direct capabilities of a web browser. We exploit the client side web browser to execute tasks during user’s navigation. The code describing the operations for the client side is embedded in a web page, thanks to Javascript and AJAX technologies. Every browser navigating into that particular site that hosts the pseudo-client of the system do a little amount of calculation using a small percentage of CPU so users do not lose usability of their computer. The main purpose of this work is to test the actual feasibility of this innovative approach and to measure the capabilities and performances of the AJAX architecture in performing distributed computing. To this aim we have defined some performance metrics and we have implemented the distributed computing code for two specific problems: the factorization of a large integer (a basic step for the RSA cypher crack) and the calculus of the correlation between genetical samples. The first problem substantially aims at showing how to implement the base system while the second aims at testing the performance of the system in presence of large data packets and relatively limited amount of computation. The rest of the paper is organized as follows: Section II introduces the AJAX functionalities; Section III presents the system architecture by describing the client, server, network and data base elements; Section IV gives some details about the realization of the system and the implementation of two problem solving examples; Section V shows some experimental results through some appropriate performance metrics; finally Section VI ends the paper and suggests future perspective. II. W EB BROWSER The distributed computing system we have developed is based on the employment of a web browser by the user. The classical definition of browser web in this context is wider: the browser is not only the program to browse web pages but also the tool that performs computation to find solutions for every subproblem. The browser plays this important role thanks to a Javascript pseudoclient embedded into web pages and through AJAX (IIA) directives which perform the communication tasks. Thanks to this solution client side does not need any software installation, almost every machine connected to the Internet has a web browser and almost every one has an

updated one. So the number of potential clients is extremely wider compared with a proprietary solution with a client and a specific channel for communication. A. AJAX AJAX [4] is the acronym of Asynchronous Javascript and XML. AJAX is an integration of consolidated technologies, such Javascript and XML, together with DOM and BOM practices, used to obtain new functionalities and more control over web applications. In the standard old web model every link produces a flickering into the browser due to the change of page: the viewloss effect. The aim of AJAX is to bring desktop programs user experience to web applications; instead of the page to page linking that produces the view-loss effect we use AJAX solutions to perform dynamic updating of the content of the page (now considered an application). Access to this features is made through the DOM and BOM api which permit to modify at runtime the contents of the browser without reloading. AJAX is defined as a rich client. This concept defines the ability of this ”technology” to give to the users the capability to interact with the web-application hosted into the web-browser in a very responsive manner, as like as a desktop application. So web-application can respond to user actions performed with mouse with multiple behaviors, differentiating actions for different mouse buttons, giving popup menus and other typical widgets used into desktop application. Also keyboard shortcuts and special keys can be triggered to perform other typical actions such as saving or modifying in a special way the application content, in addition to the standard cut/copy/paste operation usually implemented into a web browser for the web pages text. On the other side, the main difference between a AJAX webapplication and a common desktop program is the network infrastructure between client and server. The application has data stored on the server and logic stored into the client side so we have to pay attention to the network time in relation with the various network parameters such as latency, physical connection line, congestion. These and other considerations are summarized in literature as the four principles of Ajax[4]: • Browser host application, not content: usually clients are not aware of the user session: all this information are stored into the server session. With AJAX we can keep some of the application logic into a browser session, so the client side is aware of the evolution of the current session. • Server send data, not content: the whole necessary application code is sent at the beginning in a unique solution. At a later time, since all the application logic is still loaded into the client browser, only data need to be transferred between client and server side. The page is never fully reloaded, only updated with the server response data or due to the user action. • Fluid and continuous interaction: users are no more limited in their action to form submitting or hyperlink



navigation: they can trigger actions with keys, drag&drop, mouse movement and other adding interactivity to the application without workflow interruption due to pagereloading. Programming discipline: AJAX is true programming, so attention has to be payed to how the code is written. The application has to run without errors or malfunctions during all the execution time, both internally and with respect to the browser, that hosts the application. Client side must be robust and error-free because is hosted by the browser running into not controllable machines. III. A RCHITECTURE

The basic architecture follows the client-server approach: clients, embedded into web pages, request subproblems to solve, calculate solution and send results to the server; this process is iteratively repeated while clients remain active. The server side consists of several tasks, such as: the distribution of the original problem by dividing it into subproblems, the reception of the results, the consolidation of the results to form the total solution and the management of data-losses due to errors and/or network problems. 1) Client: The client is written in Javascript language. Using AJAX capabilities, such as XMLHttpRequest objects, the client communicates with server side with timed tasks to obtain data from it. When data is received, the client calculates a solution for the received subproblem using appropriate algorithms embedded into the code. Computation algorithms have to be re-engineered to permit control over CPU load and execution speed, to avoid user clients annoying blocks or malfunctioning. This is an important characteristic of the solution because we want clients not to have perception of the computation, that has to be completely transparent for the users. Browser Javascript has no multithreading capabilities, so a standard loop would block user interface until loop is finished. This behavior is unacceptable if a loop needs several seconds, or minutes to be completed. Classic for and while-do loops are re-engineered with scheduled iterated executions of the loop code, leaving control to the browser user interface between two iterations, so the user maintains his interaction with the browser interface. Details concerning how to re-organize and modify standard programming loops are reported in Section IV. The client side does not require particular technologies; the target is to get as most clients as possible, using only functions and functionalities provided by standard browsers. The only required functionality is the Javascript and XML support, and this is present in most of the modern machines browsing the Internet. 2) Server: Server side manages problem partitioning, subproblems distribution and results reassembling, communicating with client through the http web protocol. Server logic is implemented into java servlets called by the AJAX client side application through http POSTs and GETs.

Servlets manage clients queries, make queries on the database to retrieve subproblems parameters and to store results. Every operation is traced into logs, written by the serlvets at the end of each completed task. The project is substantially independent from a particular application server, provided that it supports Java Servlet and JSP. The application has been developed on Sun Application Server v9.0 [6] because of its support to all the J2EE platform and architecture and its integration with Netbeans [7], used as framework for all the development process. 3) Network communication: Every client-server connection is made using standard http requests forwarded by the client using XMLHttpRequest a Javascript object. This object has the capability to perform asynchronous http requests so browser does not need to refresh the page: http requests and responses are managed by the underlying javascript embedded in the displayed web pages. Response results can be inserted in the displayed page at runtime, i.e. using DOM APIs, avoiding the typical page flickering associated with standard web navigation. 4) Data: Problem data is managed with a database queried by the servlets on the server side, while clients manage subproblems data received from the server into a XML package parsed to retrieve subproblem parameters. IV. D EVELOPED S OLUTION In this Section we report in detail how we have implemented the architecture described in Section III. Code 3.1 explains how to overcome the lack of multithreading in Javascript. The problem is solved with a re-engineered cycle that substitutes the standard while and for loop: Code IV.1 scheduled function function f(parameters) { ... ** function code ** ... if(!ended && !terminated) { setTimeout(f, timeout); //last istruction //timeout(int [ms]) } }

The function f implements the operations that clients has to realize to solve its assigned subproblem. The function f schedules itself for iterated execution until at least one of the two conditions of the if...then statement becomes false, i.e., when the whole process is finished (ended become true) or when the user intentionally terminates the computation (terminated become true). Playing with the timeout variable and with the function code it is possible to control the amount of computation performed during each cycle and thus the CPU load, determined by the length of the function f that does not release CPU control until the end of the code execution. Timeout controls the time between two subsequent cycles of the function evaluation: the possibility of varying this

parameter at runtime permits a dynamic control of the CPU load, performed in automatically or manually by the user using the browser interface and a suitable control in the web page. The communication layer of the client application is realized with a couple of other Javascript functions: • request(): it performs a GET http request through XMLHttpRequest AJAX object to the server, requesting a subproblem to evaluate. The function creates a new XMLHttpRequest object and sends a GET request, querying for a new subproblem. Response contains XML description of the subproblem; XML is parsed to retrieve problem data and then first call to evaluation function is made. • postResult(result): it performs a POST http request through XMLHttpRequest AJAX object sending to server results of the computation. This function creates an XML envelope with evaluated results and sends it to the server. Client and server side are completely independent from each other, they communicate only through standard XML packets so each side can use a specific architecture and/or technology. In this particular case the server side is realized with servlets that offers the entry points for GET and POST http requests made by clients (see code IV.2 and IV.3). Code IV.2 processGetRequest protected void processGetRequest( HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { ... params = getParams(); if(params!=null) { //making of XML packet response.getOutputStream(). write(xmlPacket); } else { response.sendError(204); } ... }

Code IV.3 processGetRequest protected void processGetRequest( HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { ... //parsing received XML; //database update queries ... }

The general scheme of the XML packet involved in the data transmission is slightly different for GET and POST requests (see code IV.4 and IV.5). Finally we explain other two aspects, not directly concerning the distributed computing architecture but very important:

Code IV.4 processGetRequest value ...

Code IV.5 processGetRequest value ...

the database and the logging of the operation performed during the problem solving. The database, besides hosting the data of the problem, has also the important task of keeping track of the pending subproblems and of the lost subproblems. With lost problem we refer to those for which no response returned from the client within a certain time; so they are considered lost and have to be recovered and re-evaluated by another client. The database hosts the following data: • original general problem data • pending subproblems • lost subproblems • next problem data: useful to optimize queries to retrieve the next subproblem to evaluate. Every query also updates this data so next query can find a new next subproblem. As it can be easily noted the overall process is completely independent from the specific problem to solve; the same scheduling, communication process and database organization can be used to solve any problem, changing only the algorithmic part defined in function f and specific for every problem. A. Examples The problems for which we have implemented the code and tested the architecture behavior are the following: • RSA cryptosystem crack: it consists in practice in the factorization of a large integer n obtained by the product of two primes numbers p and q. Once n is factored, every other parameter of the RSA algorithm can be founded easily, so the crack of the cypher is complete. Our major concerns was the implementation of the distributed computing architecture so the factorization algorithm is not optimized. We just have looked for the prime factors of n by trying all possible odd divisors. The set of all the possible odd divisors has been divided into little intervals that constitutes the basic data of each subproblem. Each clients receives from the server one of these subsets on which it starts performing the operations specified in the Javascript code. • Pearson’s correlation evaluation on genetic samples: it consists in computing the correlation in a large database of genetic data. Each record of the database is a set of samples and has to be correlated with each one of the others; the purpose is to find such records that present a

relevant correlation index. The algorithm implements the Pearson’s correlation formula [8]:    Y XY − X· N r= .    2 2 ( X 2 − ( NX) ) · ( Y 2 − ( NY ) ) The factorization problem substantially aims at implementing the base system and showing its feasibility, while the second aims at testing the performance of the system in presence of large data packets and relatively limited amount of computation. The correlation problem is an example of application that stresses the system and allows to look for its limits relatively to the computation and the network transfers time. In practice the system reaches its limit when the distribution of the subproblem to solve is more expensive than direct solving. When the transfer time of the various subproblems require more time than the solution by the clients. In this scenario more time would be spent in data transfer than in computing the solutions of the problem, losing the benefits of the distributed architecture. V. P ERFORMANCE M ETRICS AND R ESULTS In this Section we present some results concerning the performances of our proposed solution in relation to the amount of data transmitted and time necessary for the data transmission. The system allows to measure the following quantities: • Din : GET request packet dimension. • Dout : POST request packet dimension. • tget : GET request total time. • tpost : POST request total time. • tc : subproblem evaluation time. Thanks to the mentioned measurements we can define some metrics to evaluate the performance of the system in different situations and to decide if the application of the system is convenient. The distributed system is convenient when the time spent in evaluating the subproblems is greater than the time spent in transferring data between server and clients. The first metric we define is the Packet-to-Transfer time Ratio (PTR), that estimates the speed of the data transfer and it is calculated as follows: Din + Dout PTR = (1) tget + tpost With the metric Transfer-to-Evaluation Ratio (TER) we can define the ratio between the time necessary to transfer data of a subproblem and the time needed to evaluate the subproblem solution; it is calculated as: tget + tpost T ER = (2) tc Finally the metric Data-to-Evaluation Ratio (DER) gives an idea about the algorithm used to evaluate the problem solution; high rate mean high use of the data while low rate mean poor use of the transferred data. Din + Dout (3) DER = tc

M EASURE\P ROBLEM

PTR TER DER

Factorization 0,694 Byte/ms 0,133 0,092

Correlation 314,982 B/ms 0,041 12,898

TABLE I M EASURES : FACTORIZATION AND CORRELATION RESULTS

These three metrics are linked to each other by the following relation: DER

=

P T R · T ER

(4)

In particular we have applied these metrics to test the implementation of the two problems described in the previous Section. Results are reported in Table I and have been obtained by running clients inside the LAN of the University of Ferrara. The PTR in the first problem, the RSA crack, has measured a network speed quite low for the LAN environment: this is because the very small data packets transferred do not efficiently use the network connection, while the second problem, the Pearson’s correlation, transfers data with an high speed on the network. The index TER is probably the most important one since it indicates the ratio between the time spent in transferring the data and the time spent during their evaluation. The convenience threshold for the index is 1: over this value the implementation is not convenient because more time is spent in transferring data than evaluating results. Under the unit, and closer to 0 as possible, the system become powerful and experiments show that for the solved problems implementation is quite good. Particularly a little optimization has been done to obtain an optimal value of TER for the Pearson’s correlation problem, where data transfer is critical respect to the evaluation. Finally DER shows the usage of the data of the problem received by the client: best values are near 0 and RSA crack is very close to this value because of the extremely long calculus on the few number that characterize the subproblem. For the correlation index is slightly larger but for this measure there are a lot of parameters involved, such as the number of loop performed at each step and number of steps that have to be performed. So the index gives a general idea of the performance is not very significant comparing different problems but compared with other index of one single problem to retrieve information about optimization an data usage. Performance measured during experiments are good for the realized system. Nevertheless it has to be pointed out that on real Internet the controls of the execution speed and of the subproblem packaging have to be modified according with the difference between a LAN network and a WAN network, in terms of connection speed and latency. VI. C ONCLUSIONS The system presented in this paper is a first study about the realization of a new solution of distributed computing using web as infrastructure. Final results of this work have showed that such a system can be realized with good performance and a simple architecture, using well known technologies and softwares, standard

protocols and specifications, to obtain a worldwide installation free client of distributed computing, that is the main goal of the project. Some aspects have still to be analyzed. For example an important issue concerns the security related to the code executed by the client and to the results received by the server: on the client side we must assure that the code executed does not cause malfunctions, data loss or other problems (thinking about the system in terms of offered service as third part); on the server side we have to pay attention to the data returned by the computation, because of the client can potentially send malicious or wrong results to the server. Another goal is to re-engineer the basic system to generalize it by extracting the algorithm dependent part from the lowlevel architecture that has the task to perform the distribution of the subproblems and the reassembling of the result. Now the base system has the problem logic embedded into the program so any particular problem solved has a particular programmed part that solves it. The aim is to move the problem logic outside the architecture to obtain a more general environment that accept problem solving algorithms as a problem part, so as like as any other problem data. In this view algorithms are at the same level of the problem data, being passed as a problem parameter to solve the problem itself. Future activities will be devoted to solve this issues to obtain a robust system, general in terms of architecture and powerful in performance. R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8]

Folding@Home, http://folding.stanford.edu/ SETI@Home, http://setiathome.berkeley.edu/ LHC@Home, http://lhcathome.cern.ch/ Dave Crane, Eric Pascarello, and Darren James. Ajax in Action, Manning, 2006. Wikipedia, RSA, http://en.wikipedia.org/wiki/RSA Sun Microsystems, http://www.sun.com/software/ products/appsrvr pe/index.xml Netbeans.org, http://www.netbeans.org HyperStat Online, Pearson’s correlation, http://davidmlane.com/hyperstat/A34739.html

Suggest Documents