The Role of Data in Creating Next Generation Enterprise Systems Aad van Moorsel‡, Katinka Wolter† , Philipp Reinecke† ‡ Newcastle University, School of Computing Science, Newcastle upon Tyne, United Kingdom † Humboldt-Universit¨at zu Berlin, Institut f¨ur Informatik, Berlin, Germany
[email protected], {wolter, preineck}@informatik.hu-berlin.de
Abstract Companies such as IBM, HP, Microsoft, Deloitte, Accenture or SAP all subscribe to a vision of on-demand, virtualised, self-managing enterprise computing solutions. In such visions, resources will be shared among applications, flexibly assigned, and managed without human involvement. Technologies such as virtual machines, web services, grids and semantic web are relied upon to fulfil the vision. To deliver on the cost reduction and profit increase promised by these technologies, advanced mathematical algorithms are required. Researchers in academia as well as industrial labs need system data to test these algorithms in realistic situations. In this paper we discuss the role data plays in the research on next generation enterprise computing systems. We illustrate this discussion by our own research on time-out oracles. We will also propose the creation of a general purpose public application-level test bed to create useful data for researchers in industry as well as academia.
1 The Need for Data to Advance Enterprise Computing Next generation enterprise computing systems are envisaged to be highly adaptive to the needs of users, customers, operators or any other player. The business reasoning behind creating such adaptive infrastructures is clear: by adapting resource use to the (changing) needs of various parties, resources can be utilised much more effectively, thus becoming more cost effective. An adaptive infrastructure also enables operators and administrators to do their job more effectively. Even stronger, adaptivity hooks in the infrastructure allow for automated system and application management, thus further reducing people (i.e., operator and administrator) involvement. The industrial push for an adaptive infrastructure is therefore not surprising, and neither is the amount of research invested to create this infrastructure. However, it oc-
curs to us that even industrial research suffers from a lack of accessible data, and therefore from a lack of realistic test environments. Advanced methodology and tools are built and evaluated using small-scale or even synthetic or simulated data. These methods and tools may not be as useful in practice, as the used data may not be representative for practical applications. Hence, although a methodology or approach may be valid, its application to practice will demand considerable additional research and engineering effort. We argue, therefore, that the bottleneck in data collection for next generation enterprise computing does not lay in the gap between industry and academia (since industry often does not have data either), but in the lack of environments in which such data could be generated. The gap is rather between research and practice, than between industry and academia. This creates an interesting challenge for industry, academia and government alike: how can we create test beds that generate realistic data to further develop next generation enterprise systems? In Section 3 we put forth ideas for an application-level test bed inspired by the PlanetLab network-level test bed. First, we further illustrate the issues using our own recent research efforts in determining optimal time-out strategies.
2 An Example of Academic Research: Timeout Oracles Our interest in researching time-out strategies was triggered by the counter-intuitive insight that it may be beneficial to time out and retry a task, even if no (discernible) failures occur [1]. This is true if the completion time of a job is lognormal, heavy-tailed or bimodal. We developed optimal time-out (or retry) strategies for this situation, the results of which are summarised in [2]. To determine if existing systems exhibit completion time distributions with the desired characteristics, we created a test bed for downloading web pages and collected data. We found that the time to set up TCP connections is indeed so variable that the resulting completion time distribution is such that retries using our strategy improve TCP set-up times. In addition,
10000
8000
CST2 in ms
6000
4000
2000
0 0
2000
4000
6000
8000
10000
CST1 in ms
Figure 1. Scatter plot for TCP connection setup time. TCP’s time-out values are clearly visible in the clusters in the off-diagonals around 3000ms and 9000ms on each axis.
the above scatter plot shows that for the cases where retries make sense, the next try is not strongly correlated with the aborted attempt. This implies that the assumptions behind our mathematical results hold [3]. In follow-up work, we researched time-out values within the Web Services Reliable Messaging protocol. We evaluated and compared the optimality of algorithms to compute the timeout for message retransmissions, both relative to each other and to a scenario without restart. We employed HTTP and SMTP as SOAP Transports. Typical faults – packet loss and connection disruptions – were injected at the IP layer, and their effects monitored on the WSRM level. We compared various oracles with respect to efficiency and overhead, as discussed in detail in [4].
3 Test Beds for Service-Level Data The fault-injection test bed we created for our work on time-out oracles is sufficient for our initial research, but beyond that quickly turns out lacking. First of all, the data generated from our test bed and fault injection set-up can not be guaranteed to be representative for real-world applications. To further test and improve our algorithms, we eventually need a real-world web services test environment. Furthermore, real-life situations may provide us with additional failure data that we did not use in our algorithms. In our work, we exclusively observe response times, while in practice times of failures and other system variables may be available. To improve the practical value of our analysis we must close the gap between the data obtained from practical systems and the data required by our system management
algorithms. Another problem we have not touched on is that of the networked version of the optimisation problem, since in our work we assume a single oracle user, while all other time-outs are set in the traditional way. Widespread use of our oracle, however, might invalidate our conclusions. To test this out, a sizable test bed is needed. In an ideal world, we would have a ‘community’ test bed of global proportions in which application or service level experiments can be carried out by any interested party. This implies that services implemented by one can be used by others in their experiments. This also means that base technologies such as WSRM are installed and made accessible to the community. One can even envision a test bed that includes human behaviour, not unlike what is done in experimental economics. Considerable, and possibly unsurmountable, challenges consist to establish such a test bed. It should be obvious that this goes considerably beyond test environments like PlanetLab/GENI, since these concentrate on network technologies. For the research in enterprise computing systems we need to move beyond such platforms and create application or service level alternatives.
References [1] S. M. Maurer and B. A. Huberman, “Restart strategies and Internet congestion,” in Journal of Economic Dynamics and Control, vol. 25, pp. 641–654, 2001. [2] A. P. A. van Moorsel and K. Wolter, “Analysis of Restart Mechanisms in Software Systems,” IEEE Transactions on Software Engineering, Vol. 32, Nr. 8, pp.547–558, 2006. [3] P. Reinecke, A. P. A. van Moorsel., K. Wolter ”Experimental Analysis of the Correlation of HTTP GET invocations”, in Lecture Notes in Computer Science, Proceedings of the Formal Methods and Stochastic Models for Performance Evaluation. Third European Performance Engineering Workshop, A. Horvath and M. Telek (Eds.) , Springer Verlag, Vol. 4054, pp. 226-237, 2006 [4] P. Reinecke, A. P. A. van Moorsel, K. Wolter, “The Fast and the Fair: A Fault-Injection-Driven Comparison of Restart Oracles for Reliable Web Services,” in International Conference on the Quantitative Evaluation of Systems (QEST), IEEE Computer Society, pp. 375–384, 2006.