System Administrators are Users, Too: Designing Workspaces for Managing Internet-Scale Systems Rob Barrett IBM Almaden Research Center 650 Harry Rd San Jose, CA 95120 USA +1 214 252 0818
[email protected]
Yen-Yang Michael Chen 445 Soda Hall Computer Science Division Univ. of California at Berkeley Berkeley, CA 94720 USA +1 510 643 9435
[email protected]
THE TOPIC Administrators as Users
The focus of most human-computer interaction work has been on the end users of computing systems, those using computers to accomplish their work. However, another important class of computer users is the cohort of administrators who design, build, maintain and troubleshoot computer systems as their main work. As computers have become more ubiquitous, and particularly as networked services have grown through the use of email, the web, and instant messaging, the importance and complexity of computational infrastructure has increased. End-users have become dependent upon the availability of services such as network-attached storage, web servers, and email gateways. These Internet-scale services often have thousands of hardware and software components [3,6] and require considerable amounts of human effort to plan, configure, install, upgrade, monitor, troubleshoot, and sunset. The complexity of managing these services is alarming in that a recent survey of three Internet sites showed that 51% of all failures are caused by operator errors [7]. The goal of this workshop is to bring together researchers, product designers, and system administrators to increase interest within the CHI community for improving the environment in which system administrators work. Costs, Risks and Benefits
The importance of improving the system administration environment includes three related ideas: cost, risk and benefit. There is a cost associated with the difficulty of
Paul Maglio IBM Almaden Research Center 650 Harry Rd San Jose, CA 95120 USA +1 408 927 2857
[email protected]
administering complex computational systems. And due to the decreasing cost of computational technology, the human cost of operating computers has become more noticeable and significant. For example, ten years ago a data storage facility spent approximately two thirds of its money on technology for storing information and one third on the human operators of the system. But now the fractions have reversed, so the relative cost of human operators has doubled [1,2,5]. This increasing relative cost of administration suggests that significant cost savings could be gained by focusing on improving the human factors of the administration environment. Second, as technical complexity increases there is an increasing risk of costly breakdowns. Furthermore, as individuals and society become more dependent on information systems for critical tasks, these risks are not just monetary. Having a world-scale auction web site go down may lose $225,000 per hour [8], but even more importantly loss of communications and computer services can mean the difference between life and death in emergency situations. The growing threats of viruses and e-terrorism compound the urgency of the situation. Just as power plant and aircraft controls have been greatly improved through human factors work, similar detailed attention to computational systems will help to minimize these risks. Third, there are tremendous benefits to be gained through improving system administration user interfaces. Beyond the decrease in cost and risk, there is an opportunity to increase the rate of deployment of beneficial computer services as the process of taking systems from developer to operation is simplified. Furthermore, more reliable base services allow the construction of more complex and capable superstructures. In many cases, it is the human component that acts as the bottleneck in getting useful services to end users.
Range of Issues
The task of system administration is greatly varied, and a number of issues fit within the purview of this workshop. Note that the focus will be more on the human-computer interaction of actually running systems than on the humanhuman interaction between administrators and end users. Some examples of issues of interest: Management of Scale and Diversity includes a range of issues. Scale problems begin with the prediction of the quantity of computational power that will be needed for the solution. Then when the system is deployed, there may be hundreds or even thousands of replicated components constituting the final system. These components must be managed, monitored, and repaired as things go wrong. User interfaces for large-scale systems must scale with the size and complexity of the system. Diversity of hardware, software, and interconnections also continues to grow, greatly increasing the required knowledge for administrators. System Monitoring and Notification is the process of rendering the state of a complex system into a form that is amenable to human comprehension. It also involves occupying a suitable portion of the operator’s attention based on the state of the system (e.g., operating normally vs. melting down) and the operator’s interest. Monitoring issues involve collecting information from diverse and distributed components and then providing a suitable visualization of the composite information. Problem Solving aims at keeping systems running by decreasing the mean time to recovery (MTTR) after a problem occurs. Recovery involves observing the problem, determining its cause, determining the remedy, and implementing the remedy. Problems include hardware and software errors, as well as human errors. There are obviously many other issues involved in the management of systems, but this section gives a flavor of the diversity. Goal
The goal of this workshop is to foster interest within the CHI community in the problem of designing workspaces for those who manage Internet-scale computing systems. Though there is considerable overlap with traditional HCI work, the problems faced by system administrators are also unusual because of the complex systems they manage and the power-user skills of the administrators. Devoting energy to this effort is important and timely because system administrators are increasingly critical for the operation of the systems upon which our networked world relies. Furthermore there is an increasing disparity between the complexity of the task they face and the relatively unsophisticated tools that predominate their workspace (e.g., telnet and email). The issue is also timely
because of the recent recognition of the high people-cost of operating large computing systems [4]. A further goal of the workshop is the connection of the research and development community with real-world practitioners. It is an unfortunate reality that most middleware systems and their user interfaces are designed by people who have little idea what an actual large-scale deployment feels like. Both designers and users will benefit from the exchange of ideas and experiences. FORMAT OF THE WORKSHOP Participation Solicitation
Because this topic is fairly new to the CHI community, the program committee will advertise the workshop beyond the normal CHI channels. The call will be distributed to related SIGs, such as SIGCPR/SIGMIS, and mailing lists for related groups, such as USENIX LISA and the System Administrators Guild (SAGE). Contributions will also be solicited from field practitioners by means of user groups, such as database administrators, system administrators, web server administrators, etc. Finally, input will be solicited from industry, both from the developers of system software as well as services organizations that manage large computational systems on behalf of their customers. Submission Requirements
Applicants will be required to submit a position paper. This paper will be limited to 3 pages and should include: (a) A description of a particular system administration problem that fits within the range of topics of the workshop. The problem definition should be limited and precise enough that different solution approaches can be generated and compared. (b) An analysis of the problem and/or an approach to solving it. (c) A short biography followed by a description of your current discipline and reason for interest in this topic. It should also include a list of relevant writings on the topic (URLs preferred) and at least one pointer to someone else’s related work that you find interesting. Evaluation Criteria
The number of participants will be limited to 15, preferably five each of developers, researchers, and system administrators. Because this is a cross-disciplinary and formational meeting, the program committee will choose participants so that skills and experience are complementary. The selection of participants and the deciding of panel topics will be done together to form a coherent group. Since the position papers will not be presented orally at the workshop, the relevance, insightfulness and quality of the written paper will be a strong factor in evaluating submissions.
Pre-Workshop Activities
All participants will be required to generate a three-minute videotape of a real-world system administration activity; both positive and negative examples will be appreciated. Participants will also be encouraged to submit “Hall of Fame” and “Hall of Shame” anecdotes about the management of systems. A website will be set up to share pre-workshop input.1 The program committee will formulate discussion topics based on the position papers and biographies of the participants. These topics will be distributed to the participants so that they will be prepared for the workshop panel discussions. Position papers will not be presented orally at the workshop, so all participants will be expected to read all of the other position papers. Workshop Schedule
The one-day workshop of six working hours will be broken down into four sessions, two in the morning and two in the afternoon. The first three will follow the same format with each session focused on the perspective of one of the three participant types. The final session will be a crossdisciplinary design session where the varying viewpoints of the different participants will contribute to addressing a particular problem. The goal of the final session will be to outline a joint paper and distribute the work for writing it. Session I: Life in the Wild – System Administrators
• • •
•
Summary of Position Papers by Moderator [15 min] Videotapes [25 min (five at 5 each)] Panel on Selected Topic [30 min (five at 6 each)] Example topics: “A Day in the Life” “Monitoring Complex Systems” “When a Problem Strikes” Discussion [20 min]
Session II: Tools for Managing – Developers
• • •
•
1
Summary of Position Papers by Moderator [15 min] Videotapes [25 min (five at 5 each)] Panel on Selected Topic [30 min (five at 6 each)] Example topics: “The Role of Standards” “Coordinating Heterogeneous Systems ” “The Thought Behind our UI Design” Discussion [20 min]
http://www.cs.berkeley.edu/~mikechen/chi2003-sysadmin/
Session III: What is Going On? – Researchers
• • •
•
Summary of Position Papers by Moderator [15 min] Videotapes [25 min (five at 5 each)] Panel on Selected Topic [30 min (five at 6 each)] Example topics: “The Fruit of Ethnography” “Collaboration in Problem Solving” “Traditional WIMPs: Pros & Cons” Discussion [20 min]
Session IV: Cross-Disciplinary Design Session
• • •
Refining the Topic The Three Perspectives Closing Panel
[30 min] [30 min] [30 min]
Poster
The organizers plan to generate a poster from the workshop, including both lessons learned and the crossdisciplinary design results. Plan for Dissemination
We hope to generate publication quality papers from the participants for collection into a special issue of a journal. Special Requests to the Workshop Chairs
We request that the CHI conference fee be waived for participants who only wish to attend the workshop. Because we will be soliciting participation by people who are not traditionally part of the CHI community, we hope to make it possible for them to join the workshop without having to pay for or attend the whole conference. We request the use of a videotape player and television or projector for use during the workshop so that participants can share the videos they have prepared. We also request the use of a computer projector for presentations. ORGANIZERS’ BACKGROUNDS
Rob Barrett and Michael Chen are the chief organizers of this workshop and are the ones for whom the fees should be waived. The other organizers listed have agreed to form a program committee and to participate in the workshop as needed. Because of the early stage of this field, we sought to find a critical mass of participation from the earliest stages. Rob Barrett is a Research Staff Member at the IBM Almaden Research Center in California where he works on systems within the domain of HCI. Previous work presented at SIGCHI includes an intermediary approach to designing web applications and transfer functions for pointing devices. His current work involves quantifying system administrator work through instrumenting user interfaces, and the role of humans in large-scale autonomic computing systems. He has over 40 publications and 12 patents in fields ranging from applied math to physics and computer science.
Michael Chen is a Graduate Student Researcher at the University of California at Berkeley with a background in Internet systems and HCI. His research has focused on search engine interfaces and improving the manageability of Internet systems. In particular, his work on Recovery Oriented Computing involves path-based visualization of dynamic system state and automating failure detection and recovery. Paul Maglio is a cognitive scientist and manager at the IBM Almaden Research Center. His research has focused on user mental models of information systems and tricks people use to simplify the cognitive problems they solve. He has published widely in cognitive science, computer science, and human-computer interaction. Paul's research group is actively involved in the development of system administration interfaces for several key IBM middleware products. He has participated in several CHI workshops. Aaron Brown is a Graduate Student Researcher at the University of California at Berkeley and a member of the Recovery Oriented Computing (ROC) project. Combining his background in traditional system architecture and operating system design with a strong interest in HCI, his research is on designing server-class systems that provide forgiving environments for their system administrators. In particular, Aaron's research is focused on designing and evaluating mechanisms that let systems mitigate, rather than suppress, human operator error, encouraging fast recovery from mistakes and providing a safe space for exploration. Robert Uthe is the User Interface Architect at Tivoli Systems, an IBM company focusing on systems and network management. His earlier work includes products to visualize and manage a computer network and the interconnected applications that make up key business processes within an enterprise. His current work concerns web consoles, common UI guidelines and tooling, usercentered design methodology, and linkages between HCI and development. He has performed on the order of 500 demos to solicit input on products and gain a better understanding of his users. He has presented at numerous conferences, including JavaOne, Share, and Tivoli’s Planet Tivoli.
Mark Verber is a director of network operations at Tellme Networks. He has twenty years of experience designing, building, and operating large, highly complex production services. His current focus is on lowering the total-cost of ownership of the Tellme service while improving reliability, through systematic instrumentation and improvement in system design, processes and tools. Mark has been on the USENIX LISA program committee three times, has been a reader and paper mentor for eight conferences, helped run the advanced topics in system administration workshop, and helped found the System Administrators Guild (SAGE). REFERENCES
1. Gartner Group/Dataquest. Server Storage and RAID Worldwide (May 1999). 2. Gelb, J.P. System-managed storage. IBM Systems Journal 28, 1 (1989), 77-103. 3. Gray, J. Dependability in the Internet Era. Available at http://research.microsoft.com/˜gray/talks/InternetAvail ability.ppt. 4. IBM. Autonomic Computing: IBM’s Perspective on the State of Information Technology. Available at http://www.research.ibm.com/autonomic/manifesto/aut onomic_computing.pdf. 5. ITCentrix. Storage on Tap: Understanding the Business Value of Storage Service Providers (March 2001). 6. Hennessy, J.L., and Patterson, D.A. Computer Architecture: A Quantitative Approach (3rd edition). Morgan Kaufmann, San Francisco, 2002, chapter 8.12. 7. Oppenheimer, D., and Patterson, D.A.. Architecture, operation, and dependability of large-scale Internet services: three case studies. Submission to IEEE Internet Computing special issue on Global Deployment of Data Centers, February 2002. 8. Patterson, D.A. A simple way to estimate the cost of downtime. Proc 16th Systems Administration Conference – LISA 2002 (Philadelphia PA, Nov 2002).