David W. Stemple, Department Chair. Department ...... TCP Send Cksum O w/Copy 10.40% 22.70% 16.80% ..... location, the cache will thrash pathologically. ...... 94] Peacock, J. K., Saxena, S., Thomas, D., Yang, F., and Yu, W. Experiences from ... Experiences with Distributed and Multiprocessor Systems (SEDMS III), pages.
NETWORKING SUPPORT FOR HIGH-PERFORMANCE SERVERS
A Dissertation Presented by ERICH M. NAHUM
Submitted to the Graduate School of the University of Massachusetts Amherst in partial ful llment of the requirements for the degree of DOCTOR OF PHILOSOPHY February 1997 Department of Computer Science
c Copyright by Erich M. Nahum 1997 All Rights Reserved
NETWORKING SUPPORT FOR HIGH-PERFORMANCE SERVERS
A Dissertation Presented by ERICH M. NAHUM
Approved as to style and content by: James F. Kurose, Co-chair Donald F. Towsley, Co-chair J. Eliot B. Moss, Member C. Mani Krishna, Member Larry L. Peterson, Member David W. Stemple, Department Chair Department of Computer Science
ACKNOWLEDGEMENTS Despite its de nition, a doctoral dissertation is never really the product of a single person. A better analogy is the surgical team which Brooks presents in The Mythical Man-Month [23] as a model for software development. A head surgeon leads the team, and personally performs the operation, but is surrounded and assisted by a group of supporters and specialists that enable the procedure. Without the team, the operation would be impossible. A dissertation could be thought of as an operation, with the author as head surgeon. In this case, my surgical team has been large indeed. It has been my privilege and pleasure to have Jim Kurose and Don Towsley as my advisors and thesis co-chairs. They have provided me with an exemplar of what research is and how it is done. They allowed me incredible freedom in my research, yet remained involved and kept me focused. Jim has been a tireless teacher, constantly impressing me with his grasp of the big picture and how to communicate it. Don has never ceased to amaze me with his ability to rapidly absorb technical challenges and to get to the heart of them quickly. The two of them make an incredible team, and have shaped me in ways that I cannot articulate with any justice. I am deeply indebted to them for their time and eort in guiding my research. Eliot Moss has also contributed to my graduate career, not only as a committee member, but also as a teacher and all-around excellent systems person. I gratefully acknowledge his insight and impact. Mani Krishna also was a great help in my thesis, and I thank him for serving on my committee. Larry Peterson I must thank for many things. Despite the inconvenience, he agreed to be on my thesis committee and serve as a reference for my job search. He also was a gracious and helpful host, allowing iv
me to come visit and work with him twice in Arizona. He too has been a model of how outstanding research is done. If anyone could also be considered a committee member, it would be my close friend and colleague, David Yates. Dave and I have worked closely together practically since arriving at UMass, and without him, graduate school would have been much more dicult and much less pleasant. I thank him for all the time he has spent working with me and for the innumerable discussions we have had. Together, we have learned a great deal and consumed a lot of Chinese food and ice cream. I will miss working this closely with him. Other faculty at UMass have also contributed to my graduate education in ways both direct and indirect. Jack Stankovic and Krithi Ramamritham supported and advised me in the early stages of my degree. Connie Wogrin has been a continual source of help and encouragement. Kathryn McKinley is an inspiration as to what a new faculty member should be, and Jack Wileden played a key role in allowing me to attend UMass. Other students have also contributed to my experiences in myriad ways. In addition to Dave Yates, Eric Brown, Amer Diwan, and Ben Hurwitz all suered through comprehensive exams with me, and over the years have been sources of friendship as well as technical expertise. The environment in the networks research lab has always been one of friendly interaction. I thank the various members, past and present, for numerous conversations and discourse: Supratik Bhattacharyya, Shenze Chen, Jayanta Dey, Victor Firoiu, Timur Friedman, Ren-Hung Huang, Sue Bok Moon, Ramesh Nagarajan, Sridar Pingali, Ramjee Ramachandran, Jim Salehi, Henning Schulzrinne, Maya Yajnik, and Zhili Zhang. Henning was also a great source of advice during the job search season.
v
The Object Systems lab was another source of information, expertise, and debate. John Cavazos, Tony Hosking, Farshad Nayeri, John Ridgway, Darko Stefanovic, and Norm Walsh all contributed to my education. Other students at UMass were also positive factors: Panos Chrysanthis, Jay Corbett, Matt Dwyer, Al Kaplan, Cristobal Pedregal-Martin, and Al Hough (after whom the infamous Al Hough Test is named). UMass students were not the only ones to add to my experiences. Many at other schools did as well, such as Mark Abbott, Mats Bjorkman, Lawrence Brakmo, Ed Bugnion, Ramon Caceres, Mark Crovella, Peter Druschel, Dawson Engler, David Mosberger-Tang, Ed Menze, Je Hollingsworth, Steve Hotz, Joe Hummel, Doug Schmidt, Dean Tullsen and Charlie Turner. Mark Crovella and Peter Druschel have been especially supportive, particularly during the job search process. Faculty and scientists at other institutions provided technical expertise as well: Norm Hutchinson, Hilarie Orman, Sean O`Malley, Rich Schroepel, Joe Touch, and Mendel Rosenblum. Several people at Silicon Graphics gave excellent advise and critique: Greg Chesson, Bill Fisher, Neal Nuckolls, and Jack Veenstra. Special thanks to Jack for MINT, without which the cache work would have been impossible. Franklin Reynolds and Franco Travostino at OSF also gave useful feedback. I am also indebted to two instrumental people in the department: Betty Hardy and Sharon Mallory, who make so many things happen. I'd like to thank Larry Davis at Maryland's UMIACS for the ARPA Fellowship in High Performance Computing, and Linda Wright at DEC for the CMG Fellowship. These awards allowed me great research and travel opportunities and freed me from worrying about funding. The foundation for my graduate work was laid by my undergraduate experience at the University of Wisconsin. I deeply thank Brigitte Jirku and Bart Miller, without whom I would have never made it to graduate school. Bart is another model
vi
researcher, and has remained a source of excellent advise and support throughout graduate school. I would like to also thank my many friends during my tenure at UMass, who were a great source of warmth and emotional support: Christine Arnold, Dan Carter, Jeremiah Cohen, Susan Cornelliussen, Cheryl Hall, Josh Hawley, Raghavan Manmatha, Laurie McLary, Amanda Shaw, Deborah Stolo, and Wendy Welsh. Finally, for their support, I'd like to thank my family: Mom, Dad, Carol, and Jed.
vii
ABSTRACT NETWORKING SUPPORT FOR HIGH-PERFORMANCE SERVERS FEBRUARY 1997 ERICH M. NAHUM B.S., UNIVERSITY OF WISCONSIN-MADISON M.S., UNIVERSITY OF MASSACHUSETTS AMHERST Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professors James F. Kurose and Donald F. Towsley Networked information systems have seen explosive growth in the last few years, and are transforming society both economically and socially. The information available via the global information infrastructure is growing rapidly, dramatically increasing the performance requirements for large scale information servers. Example services include digital libraries, video-on-demand, World-Wide Web and highperformance le systems. In this dissertation, we investigate performance issues that aect networking support for high-performance servers. We focus on three research issues:
Parallelism Using Packets. The rst part of this dissertation identi es performance issues of network protocol processing on shared-memory multiprocessors when packets are used as the unit of concurrency. Our results show good available parallelism for connectionless protocols such as UDP, but limited speedup using TCP within a single connection. However, with multiple connections, parallelism is improved. We demonstrate how locking structure viii
impacts performance, and that a complex protocol such as TCP with large connection state yields better speedup with a single lock than with multiple locks. We show how preserving packet order, exploiting cache anity and avoiding contention aect performance.
Support for Secure Servers. The second part of this dissertation shows how parallelism is an eective means of improving the performance of cryptographic protocols. We demonstrate excellent available parallelism by showing linear speedup with several Internet-based cryptographic protocol stacks, using packet-level parallelism. We also show linear speedup using another approach to parallelism, where connections are the unit of concurrency.
Cache Memory Behavior., In the nal part of this dissertation we present a performance study of memory reference behavior in network protocol processing. We show that network protocol memory reference behavior varies widely. We nd that instruction cache behavior is the primary contributor to protocol performance under most scenarios, and we investigate the impact of architectural features such as associativity and larger cache sizes. We explore these issues in the context of the network subsystem, i.e., the protocol stack, examining throughput, latency, and scalability.
ix
TABLE OF CONTENTS Page ACKNOWLEDGEMENTS : : : : : : : : : : : : : : : : : : : : : : : : : : : :
iv
ABSTRACT : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : viii LIST OF TABLES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xiii LIST OF FIGURES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xiv Chapter 1. INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
1
1.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
6
1.1.1 Packet-Level Parallelism : : : : : : : : : : : : : : : : : : : : : 1.1.2 Support for Secure Servers : : : : : : : : : : : : : : : : : : : : 1.1.3 Cache Behavior of Network Protocols : : : : : : : : : : : : : :
6 6 7
1.2 Contributions of this Dissertation : : : : : : : : : : : : : : : : : : : : 1.3 Structure of this Dissertation : : : : : : : : : : : : : : : : : : : : : :
7 9
2. PARALLELISM USING PACKETS : : : : : : : : : : : : : : : : : : : : :
10
2.1 2.2 2.3 2.4
Introduction : : : : : : : : : : : : : : : : : : Survey of Related Work : : : : : : : : : : : Research Issues in Packet-Level Parallelism : Experimental Description : : : : : : : : : : 2.4.1 Packet-Level Parallel x-kernel : : : : 2.4.2 Protocols : : : : : : : : : : : : : : : 2.4.3 In-Memory Drivers : : : : : : : : : :
2.5 Baseline Results : : : : : : : : : : : : : : : : 2.5.1 Send and Receive Side Processing : : 2.5.2 Checksumming and Packet Size : : : 2.6 Ordering Issues : : : : : : : : : : : : : : : : 2.6.1 Ordering Issues in TCP : : : : : : : x
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
10 14 15 15 16 17 19 21 21 25 26 26
2.6.2 Ordering and Correctness : : : : : : : : : : : : : : : : : : : : 2.6.3 Multiple Connections : : : : : : : : : : : : : : : : : : : : : : :
29 31
2.7 Locking Issues : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
32
2.7.1 Locking Granularity in TCP : : : : : : : : : : : : : : : : : : : 2.7.2 Atomic Increment and Decrement : : : : : : : : : : : : : : : :
32 35
:::::::: :::::::: :::::::: :::::::: 3. SUPPORT FOR SECURE SERVERS : : : : : : : : 3.1 Introduction : : : : : : : : : : : : : : : : : : : : 3.2 Survey of Related Work : : : : : : : : : : : : : 3.3 Research Issues : : : : : : : : : : : : : : : : : : 3.4 Protocols Used : : : : : : : : : : : : : : : : : : 3.5 Parallel Infrastructure : : : : : : : : : : : : : : 3.6 Packet-Level Parallel Results : : : : : : : : : : : 3.7 Connection-Level Parallelism : : : : : : : : : : : 3.8 Discussion : : : : : : : : : : : : : : : : : : : : : 3.9 Conclusions : : : : : : : : : : : : : : : : : : : : 4. CACHE BEHAVIOR OF NETWORK PROTOCOLS 4.1 Introduction : : : : : : : : : : : : : : : : : : : : 4.2 Related Work : : : : : : : : : : : : : : : : : : : 4.2.1 Closely Related Work : : : : : : : : : : : 4.2.2 Less Closely Related Work : : : : : : : : 4.3 Experimental Infrastructure : : : : : : : : : : : 4.3.1 Architectural Simulator : : : : : : : : : 4.3.2 Network Protocol Workload : : : : : : : 4.3.3 Validating the Simulator : : : : : : : : : 4.4 Characterization and Analysis : : : : : : : : : : 4.4.1 Baseline Memory Analysis : : : : : : : : 4.4.2 Impact of Copying Data : : : : : : : : : 4.4.3 Hot vs. Cold Caches : : : : : : : : : : : 4.4.4 Instructions vs. Data : : : : : : : : : : : 4.4.5 Instruction Usage : : : : : : : : : : : : : 4.5 Architectural Sensitivity : : : : : : : : : : : : : 2.8 2.9 2.10 2.11
Per-Processor Resource Caching Architectural Trends : : : : : : 20 Processor Results : : : : : : Conclusions : : : : : : : : : : :
: : : :
xi
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
37 38 41 44 46 46 48 49 49 51 51 55 57 58 60 60 62 62 65 66 66 68 68 71 71 73 75 77 79 83
4.5.1 Increased Cache Size : : : : : : : : : : : : : : : : : : : : : : : 4.5.2 Increased Cache Associativity : : : : : : : : : : : : : : : : : : 4.5.3 Future Architectures : : : : : : : : : : : : : : : : : : : : : : :
83 85 88
4.6 Improving I-Cache Performance with Cord : : : : : : : : : : : : : : : 4.7 Conclusions and Future Work : : : : : : : : : : : : : : : : : : : : : :
89 92
5. SUMMARY AND FUTURE WORK : : : : : : : : : : : : : : : : : : : : :
95
5.1 Summary of the Dissertation : : : : : : : : : : : : : : : : : : : : : : : 5.2 Suggestions for Future Work : : : : : : : : : : : : : : : : : : : : : : :
95 97
APPENDIX: VALIDATING AN ARCHITECTURAL SIMULATOR : : : : : A.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A.2 Architectural Simulator : : : : : : : : : : : : : : : : : : : : : : : : : :
98 98 99
A.2.1 A.2.2 A.2.3 A.2.4 A.2.5
Assumptions : : : : : : : : : : : : Modeling Instruction Costs : : : Pipelining in the R4000 : : : : : Modeling Memory References : : Summary of Validation Sequence
: : : : : BIBLIOGRAPHY : : : : : : : A.3 A.4 A.5 A.6 A.7
Validation Results : : Sample Output : : : Improving Accuracy Lessons Learned : : : Summary : : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
xii
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
100 102 103 105 106 107 109 111 112 114 116
LIST OF TABLES Table 2.1 3.1 3.2 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 A.1 A.2 A.3 A.4 A.5 A.6
Percentage of packets out-of-order : : : : : : : : : : : : : PLP Latency Breakdown (sec) : : : : : : : : : : : : : : CLP Latency Breakdown (sec) : : : : : : : : : : : : : : Read and write times in cycles : : : : : : : : : : : : : : : Macro benchmark times (sec) and relative error : : : : Cache Miss Rates for Baseline Protocols : : : : : : : : : Cache Miss Rates for Protocols with Copying : : : : : : Latencies (in sec) with Cold, Hot, and Idealized Caches Cold Cache Miss Rates : : : : : : : : : : : : : : : : : : : Miss Rates vs. Cache Size (TCP Send, Cksum O) : : : TCP Miss Rates vs. Associativity (Cksum O) : : : : : Machine Characteristics : : : : : : : : : : : : : : : : : : Machine Latencies (sec) : : : : : : : : : : : : : : : : : : Machine CPIs : : : : : : : : : : : : : : : : : : : : : : : : Baseline & CORDed Protocol Latencies (sec.) : : : : : Cache Miss Rates for CORDed Protocols : : : : : : : : : Instruction Frequencies : : : : : : : : : : : : : : : : : : : Instructions that take more than 1 cycle : : : : : : : : : Read and write times in cycles : : : : : : : : : : : : : : : LMBench real and simulated values in sec. : : : : : : : Macro benchmark times (sec) and relative error : : : : Macro benchmark times (sec) and relative error : : : : xiii
: : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : :
Page : 28 : 54 : 56 : 69 : 70 : 72 : 74 : 76 : 77 : 85 : 87 : 88 : 90 : 90 : 91 : 92 : 102 : 103 : 105 : 106 : 107 : 108
LIST OF FIGURES Figure 1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18
Client-Server Relationship : : : : : : : Internet-Based Protocol Stacks : : : : Shared-Memory Multiprocessor : : : : Approaches to Concurrency : : : : : : Functional Parallelism : : : : : : : : : Data-Level Parallelism : : : : : : : : : TCP Send-Side Con guration : : : : : UDP Send Thrpts : : : : : : : : : : : : UDP Send Speedup : : : : : : : : : : : UDP Recv. Thrpts. : : : : : : : : : : : UDP Recv. Speedup : : : : : : : : : : : TCP Send Thrpts : : : : : : : : : : : : TCP Send Speedup : : : : : : : : : : : : TCP Recv. Thrpts : : : : : : : : : : : : TCP Recv. Speedup : : : : : : : : : : : TCP Ordering Eects (Checksum On) TCP Ordering Eects (Checksum O) Ticketing Eects in TCP : : : : : : : : TCP with Multiple Connections : : : : TCP Send-Side Locking Comparison : TCP Receive-Side Locking Comparison xiv
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : :
Page : 2 : 3 : 4 : 11 : 12 : 13 : 20 : 22 : 22 : 22 : 22 : 24 : 24 : 24 : 24 : 27 : 27 : 30 : 32 : 34 : 34
2.19 2.20 2.21 2.22 2.23 2.24 2.25 2.26 2.27 2.28 2.29 2.30 2.31 2.32 2.33 2.34 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 4.1 4.2 4.3
TCP Atomic Operations Impact : TCP Message Caching Impact : : TCP Send Thrpts. : : : : : : : : TCP Send Speedups : : : : : : : : TCP Recv. Thrpts. : : : : : : : : TCP Recv. Speedups : : : : : : : UDP Send Thrpts. : : : : : : : : UDP Send Speedups : : : : : : : : UDP Recv. Thrpts. : : : : : : : : UDP Recv. Speedups : : : : : : : TCP Send Thrpts. : : : : : : : : : TCP Receive Thrpts. : : : : : : : : UDP Send Thrpts. : : : : : : : : : UDP Receive Thrpts. : : : : : : : : TCP Locking Comparison : : : : TCP Multiple Connections : : : : PLP Send Thrpts. : : : : : : : : : PLP Send Speedup : : : : : : : : : PLP Receive Thrpts. : : : : : : : : PLP Receive Speedup : : : : : : : CLP Send Thrpts. : : : : : : : : : CLP Send Speedup : : : : : : : : : CLP Receive Thrpts. : : : : : : : : CLP Receive Speedup : : : : : : : Machine Organization : : : : : : Actual Read Times : : : : : : : : : Simulated Read Times : : : : : : : xv
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : :
36 37 39 39 39 39 40 40 40 40 42 42 43 43 43 44 52 52 53 53 55 55 56 56 67 69 69
4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 A.1 A.2
Baseline Latencies : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Baseline Percentages : : : : : : : : : : : : : : : : : : : : : : : : : : : : Copy Protocol Times : : : : : : : : : : : : : : : : : : : : : : : : : : : : Copy Protocol % : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
Send Side Latencies : : : : : : : : : : : : : : TCP Send Side Latencies : : : : : : : : : : : Instruction Usage : : : : : : : : : : : : : : : Percentage of Cycles : : : : : : : : : : : : : Latencies with Cycle Breakdown : : : : : : : Latencies with Increasing Cache Size : : : : TCP Send Side Latency with Larger Caches Protocol Latencies with Associativity : : : : TCP Send Side Latency with Associativity : Actual Read Times : : : : : : : : : : : : : : : Simulated Read Times : : : : : : : : : : : : :
xvi
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
: : : : : : : : : : :
72 72 74 74 79 80 80 81 82 84 84 86 87 106 106
CHAPTER 1 INTRODUCTION Networked information systems have seen explosive growth in the last few years, and are transforming society both economically and socially. The information available via the global information infrastructure is growing rapidly as text-only information sources are augmented with voice, video and still image data. In addition, corporations are adopting client/server architectures over traditional mainframe based approaches. Together these factors dramatically increase the performance requirements for large scale information servers. Example services include digital libraries, video-on-demand, World-Wide Web and high-performance le systems. Figure 1.1 illustrates a typical server scenario, with several client machines communicating with a high-performance server over a wide-area network. Clients submit requests for information to the server, which replies with responses. The client-server computing paradigm has seen large-scale adoption in the past few years, due to the simplicity of the model. The client/server model has bene ts other than simplicity, including ease of administration, maintenance, and accounting. Client/server computing has disadvantages as well. Its centralized model is vulnerable as a single point of failure, an issue in fault-tolerant systems. One major problem with the client-server model is that of performance. Client/server computing has an asymmetric load distribution for two reasons. First, a server must typically do more work to provide a response than a client must do to make a request. For example, a client simply requests a World-Wide Web Universal Resource Location (URL); the server must retrieve a potentially large le from the disk and transmit it 1
Client 1
Wide−Area Network Multiprocessor Server
Switch / Router High−Speed Link
Client 2 Local−Area Network Client 3
Protocol Stack
Figure 1.1 Client-Server Relationship over the network. Secondly, and more signi cantly, a single server provides responses for potentially thousands of clients. In fact, if a server is connected to the global Internet, capacity planners may not even be able to accurately estimate the demand for it, since an unknown (and growing) number of clients may be requesting service. For example, Netscape's home page server currently receives over seventy million requests a day [56]. At the heart of any information server is the network protocol software. While a server can reduce disk I/O requests by use of large disk caches, each client request always requires processing by the server's network subsystem. Thus, the networking performance requirements for servers can be demanding. In this dissertation, we address the needs of high-performance servers, focusing on performance issues in the network protocol software. Dierent applications have dierent requirements in terms of the functionality or service that they require from the network. This functionality is divided into conceptual layers, and the functionality at each layer is provided by a protocol. The network subsystem can be thought of a stack of protocols. Figure 1.2 illustrates a typical Internet-based network protocol stack. We describe speci c protocols in more detail in Chapter 2; for now we merely wish to point out that dierent protocols have dierent inherent complexities depending on the functionality that they provide. TCP, for example, is a transport protocol that provides connection-oriented, 2
HTTP
NFS
TCP
UDP IP
Network Interface
Figure 1.2 Internet-Based Protocol Stacks reliable, in-order delivery of messages across an unreliable network, typically the Internet. UDP, while also a transport protocol, does not provide the same quality of service as TCP, oering connectionless, best-eort delivery of messages. Thus, both the implementation complexity and dynamic behavior of these two protocols dier markedly. Consequently, their performance is radically dierent. As discussed earlier, the networking performance requirements for servers can be demanding. One way to address this need is through the use of parallelism. In particular, shared-memory multiprocessors make attractive server platforms. Consider Figure 1.3, which presents a typical shared-memory multiprocessor platform, consisting of several processors connected by a high-speed bus to memory, disk subsystems, and the network. These machines are becoming more common, as shown by recent vendor introductions of platforms such as SGI's Challenge [44], Sun's SPARCCenter [26], and Digital's AlphaServer [1]. Even multiprocessor PC's are now available, with Micron selling complete systems and 4 processor Pentium Pro motherboards available from Intel. The spread of these machines results from a number of factors: binary compatibility with lower-end workstations, good price/performance relative to high-end machines such as Crays, and ease of programming compared to more 3
CPU Instr
CPU
Data
Instr
Unified
CPU
Data
Instr
Unified
Data
Unified
1
2
N
Shared−Memory Bus High−Speed Network Device
Figure 1.3 Shared-Memory Multiprocessor elaborate parallel machines such as Hypercubes. Probably the greatest factor is one of economics: shared-memory multiprocessors built using microprocessor parts allow the manufacturer to leverage uniprocessor development costs. Given their utility as networked servers, a signi cant research problem is to determine the appropriate networking support on shared-memory multiprocessors: to achieve scalability, avoid bottlenecks and exploit the full capabilities of these machines. In this dissertation, we explore issues that impact performance of network protocols on shared-memory multiprocessors. Another feature of high-performance servers is the presence of caches, to reduce the processor's demands on the bus and the memory system for instructions and data. Caches are local memory that hold copies of instructions and data from main memory, but can be accessed orders-of-magnitude more quickly. Cache memories store a subset of the instructions and data that processors use that is most likely to be used, exploiting the principles of temporal locality and spatial locality in programs. Temporal locality is the property that, if a program refers to a memory location once, it is likely to refer to that same location again in the near future. Spatial locality is the property that, if a program refers to a location in memory, there is a high likelihood 4
that the program will refer to another piece of data physically near that location. Computer systems exploit these properties in programs by using cache memories, and in fact multiple layers of cache memories are typical of computer systems today. Referring again to Figure 1.3, we see that each processor has three caches associated with it: two on-chip caches, that hold instructions and data, respectively, and one o-chip but on-board cache that holds both instructions and data. Given the presence of these architectural features on both uniprocessor and multiprocessor servers, it is important to understand the cache behavior of network protocols. In this dissertation, we examine the interaction of caches and network protocol software. This dissertation thus studies the performance issues that aect network support on high-performance servers. We focus on three separate problems in this context:
Packet-Level Parallelism. We study the available parallelism in network protocols when messages or packets are used as the unit of concurrency. We present a packet-level parallel implementation of a core TCP/IP protocol stack, identify several performance issues, and evaluate the available parallelism under several dierent scenarios.
Supporting Secure Servers. We study the available parallelism for a dierent class of protocols, namely, security or cryptographic protocols. We evaluate available parallelism for several dierent cryptographic protocol stacks using packet-level parallelism, and another approach to concurrency, connection-level parallelism.
Cache Behavior of Network Protocols. We characterize the cache behavior of a uniprocessor protocol stack, determining statistics such as cache miss rates and percentage of time spent waiting for memory. We evaluate the sensitivity of network protocols to the host architecture, varying factors such as cache size and associativity. 5
We now introduce these issues, and motivate why each of them is a factor for high-peformance servers.
1.1 Motivation 1.1.1 Packet-Level Parallelism Concurrency is required in the network protocol stack; otherwise, a server's network bandwidth will be limited by the performance of a single processor, which may become a bottleneck. Many approaches to concurrency in protocols have been proposed. One that has gained favor is packet-level parallelism [12, 48], where packets or messages are the unit of concurrency. Packets are processed in parallel, regardless of their connection or where they are in the protocol stack. This appears able to react to the workload more closely than other approaches to parallelism in protocols. We present a taxonomy of these approaches in more detail in Chapter 2.
1.1.2 Support for Secure Servers Security and privacy is a growing concern in the Internet community, because of the Internet's rapid growth and the desire to conduct business over it safely. This desire has led to the advent of several proposals for security standards, such as secure IP [6], Secure HTTP (SHTTP) [101], and the Secure Socket Layer (SSL) [53]. The intent of these protocols is to make communication over the Internet private and secure, to enable electronic commerce. Thus, the use of encryption protocols such as DES and RSA is increasing. Cryptographic protocols are by necessity extremely compute-intensive. Given the asymmetry in client/server computing, the increased use of these protocols places even greater burdens on servers. Thus, any support that can be given to servers that transfer a large amount of secure information will be well-utilized. 6
1.1.3 Cache Behavior of Network Protocols The large gap between CPU speeds and memory speeds is well-known, and is expected to continue for the foreseeable future [51]. Cache memories are used to bridge this gap, and multiple levels of cache memories are now typical. Thus, cache behavior is a central issue in contemporary computer system performance. Many studies have examined the memory reference behavior of application code, and recently work has appeared examining the cache behavior of operating systems. However, little work has been done to date examining the impact of memory reference behavior on network protocols. As networks become ubiquitous, it is important to understand the interaction of network protocol software and computer hardware. Thus, rather than examining the caching behavior of an application suite such as the SPEC 95 benchmarks, the workload that we study is network protocol software.
1.2 Contributions of this Dissertation The following are the contributions of this dissertation:
Parallelism Using Packets. The rst part of this dissertation identi es signi cant performance issues of network protocol processing on shared-memory multiprocessors using packet-level parallelism. Our results show good available parallelism for connectionless protocols such as UDP, but limited speedup using TCP within a single connection. However, with multiple connections the amount of available parallelism increases. We nd that packet ordering plays a key role in determining single-connection TCP performance. We show how locking structure impacts performance, and that a complex protocol such as TCP with large connection state yields better speedup with a single lock than with multiple locks. We nd that exploiting cache anity and avoiding contention yields signi cant performance bene ts. 7
Support for Secure Servers. The second part of this dissertation shows how parallelism is an eective means of improving the performance of cryptographic protocols. Although limited parallelism is available in protocol processing within a single connection in a baseline TCP/IP stack, when application or other protocol processing is taken into account, the available parallelism can increase remarkably. A good example of this is found using network security protocols. Since cryptographic protocols are compute-bound, they are natural candidates for parallelization. We demonstrate excellent available parallelism by showing linear speedup with several Internet-based cryptographic protocol stacks, using packet-level parallelism. We also show linear speedup using another approach to parallelism, where connections are the unit of concurrency.
Cache Memory Behavior. The third part of this dissertation presents a performance study of memory reference behavior in network protocol processing. We show that network protocol cache behavior varies widely, with miss rates ranging from 0 to 28 percent, depending on the scenario. We nd instruction cache behavior is the primary contributor to protocol latency under most cases, and that cold cache behavior is very dierent from warm cache behavior. We demonstrate the upper bounds to performance that can be expected by improving memory behavior, and evaluate the impact of architectural changes such as an increase in associativity and larger cache sizes. In particular, we nd that TCP is more sensitive to cache behavior than UDP, gaining larger bene ts from improved associativity and bigger caches. We show that network protocols are well-suited to RISC architectures, and that network protocols should scale well with CPU speeds in the future. All of these issues are explored in the context of the network subsystem, i.e., the protocol stack, examining throughput (as seen by an application) latency (as observed 8
by an individual message), and scalability (in terms of the available concurrency in the system).
1.3 Structure of this Dissertation In this introduction, we have provided the context and motivation for three problems in the area of networking support for high-performance servers. In the next three chapters we examine each of the three problems in detail. In each chapter we include a survey of related work as appropriate. In Chapter 2 we outline approaches to parallelism in network protocols, and describe our implementation of packet-level parallelism in detail. We discuss our experimental environment and the protocols used in our studies. We present experimental results, showing that the available parallelism varies depending on the protocols and number of connections used. In Chapter 3 we show how parallelism can be used to improve cryptographic protocol performance. We discuss issues in software implementation of security protocols, describe the protocols that we use, and present experimental results. We demonstrate linear speedup using both packet-level and connection-level parallelism. In Chapter 4 we examine the memory reference behavior of network protocols via execution-driven simulation. We describe our architectural simulator, and quantify the accuracy of the simulator through validation. We characterize and analyze network protocol software cache behavior, and show sensitivity to the architectural environment. Finally, in Chapter 5 we summarize the dissertation and suggest several possible avenues for future work.
9
CHAPTER 2 PARALLELISM USING PACKETS 2.1 Introduction As we discussed in Chapter 1, shared-memory multiprocessors are attractive server platforms. A signi cant research problem is to determine the appropriate networking support for such shared-memory multiprocessors that can avoid bottlenecks and take advantage of the machines' full capabilities. One way to improve performance in the network protocol subsystem to exploit the availability of multiple processors in the host. The use of parallelism in network protocol processing has recently become an active area of research in both academia [5, 12, 20, 21, 22, 35, 46, 47, 48, 59, 62, 63, 64, 69, 70, 71, 72, 74, 75, 76, 77, 87, 88, 89, 95, 100, 106, 107, 108, 109, 111, 112, 116, 117, 124, 125] and industry [18, 37, 42, 45, 49, 68, 90, 94, 110, 120]. Many approaches to parallelism in network protocols have been proposed. We provide a brief taxonomy of parallelism in protocols here; more detailed surveys can be found in [12, 48]. In general, we attempt to classify approaches by the unit of concurrency, or what it is that processing elements do in parallel. Here a processing element is a locus of execution for protocol processing, and can be a dedicated processor, a heavyweight process, or a lightweight thread. Figure 2.1 illustrates the various approaches to concurrency in host protocol processing. The dashed ovals represent processing elements. The numbers found in the lower right hand corner of the packet indicate which connection the packet is associated with. 10
TCP
1
TCP
TCP
2
1.2 1
IP
1
2.2
2
IP
IP
2
1.1
2.1
FDDI
FDDI
FDDI
Layer Parallelism
Connection Level Parallelism
Packet Level Parallelism
Thread
TCP
Protocol
1
Packet
Packet Flow
Figure 2.1 Approaches to Concurrency In layer parallelism, each protocol layer is a unit of concurrency. Speci c layers are assigned to processing elements, and messages passed between protocols through interprocess communication. The main advantage of layer parallelism is that it is simple and de nes a clean separation between protocol boundaries. The disadvantages are that concurrency is limited to the number of layers in the stack, and that large amounts of context switching and synchronization between layers occur, reducing performance [29, 30, 111]. Since an individual message is processed by at most one processor, performance improvements are limited to aggregate throughput, by concurrently processing dierent messages in dierent protocol layers, in a form of pipelining. An example of this approach is found in [46]. Connections form the unit of concurrency in connection-level parallelism, where dierent processing elements are associated with dierent connections. Speedup is achieved with multiple connections, which are processed concurrently. The advantage to this approach is its simplicity and that it exploits the natural concurrency among connections. Locking is kept to a minimum along the "fast path" of data transfer, 11
Source Lookup
Sequence Check
Destination Lookup 15
0 Source
31 Destination
ACK Check
Sequence Number Acknowledgemement Number Checksum Verification
Offset Status Bits
Window
Checksum
Urgent Pointer Padding
Options Options Processing
Window Check
Data TCP Packet Header
Figure 2.2 Functional Parallelism since only one lock must be acquired, namely, that for the appropriate connection. The disadvantage with connection-level parallelism is that no concurrency within a single connection can be achieved. This may be a problem if trac exhibits locality [60, 73, 81, 93], i.e., is bursty. Systems using this approach include [100, 110, 111, 123]. In packet-level parallelism, packets are the unit of concurrency. Sometimes referred to as thread-per-packet or processor-per-message, packet-level parallelism assigns each packet or message to a single processing element. The advantage of this approach is that packets are processed regardless of the connection with which they are associated, or the layer in the stack in which they are present, achieving speedup both with multiple connections and within a single connection. The disadvantage is that it requires locking shared state, most signi cantly the protocol state at each layer. Systems using this approach include [12, 48]. In functional parallelism, a protocol layer's functions are the unit of concurrency. Functions within a single protocol layer (e.g., checksum, ACK generation) are decomposed, and each assigned to a processing element. Figure 2.2 shows a hypothetical use of functional parallelism in the context of TCP. The advantage to this approach is that it is relatively ne-grained, and thus can improve the latency of an individual 12
Checksum Calculation
Checksum Calculation Data Data
Checksum Calculation
Data
Checksum Calculation
Data Data Data
Checksum Calculation
Checksum Calculation
Packet Data
Figure 2.3 Data-Level Parallelism message as well as aggregate throughput. The disadvantage is that it requires synchronizing within a protocol layer, and is dependent upon the concurrency available between the functions of a particular layer. Examples include [70, 71, 88]. In data-level parallelism, data are the units of concurrency, analogous to SIMD processing. Processing elements are assigned to the same function of a particular layer, but process separate pieces of data from the same message. An example is illustrated in Figure 2.3, where a single message's checksum is computed using multiple processors. The appropriate grain size of a piece of data depends both on the application and on architectural factors (e.g., the cache line size). The advantage to this approach is that it is the most ne-grained, and thus has the potential for the greatest improvement in both throughput and latency. The disadvantage is that processing elements must synchronize, which may be expensive. We are unaware of any eorts developing this approach to parallelism in protocol processing. The relative merits of one approach over another depend on many factors, including the host architecture, the cost of primitives such as locking and context switching, the workload and number of connections, the thread scheduling policies employed, and
13
whether the implementations are in hardware or software. Most importantly, they depend on the available concurrency within a protocol stack. In this chapter we address several performance issues in packet-level parallelism. Originally proposed by Hutchinson and Peterson for the x-kernel [58], packet-level parallelism has been advocated as an approach to exploiting parallelism on sharedmemory multiprocessors [12, 48, 112].
2.2 Survey of Related Work A brief outline of related work was presented in Section 2.1. In this section we discuss in more detail the most closely related work in packet-level parallelism. One of the most relevant pieces of previous work is that of Bjorkman and Gunningberg [12]. They examine packet-level parallelism using the x-kernel on a i386-based Sequent multiprocessor. They show good speedup for UDP, but limited speedup for TCP (on the send side). They do not provide detail on their locking model for TCP, instead showing only a pictorial view. Examining their code, there are 6 global locks that are used by all TCP connections, thus preventing any parallelism between connections. They do not examine TCP receive-side processing, and do not study the impact of ordering, either within TCP or to the application. We nd that ordering plays a crucial role in parallelism. The most comprehensive study to date comparing dierent approaches to parallelism on a shared-memory multiprocessor is by Schmidt and Suda [111, 112]. They show that packet-level parallelism and connection-level parallelism generally perform better than layer parallelism, due to the context-switching overhead incurred crossing protocol boundaries using layer parallelism. In [112], they suggest that packet-level parallelism is preferable when the workload consists of a relatively small number of active connections, and that connection-level parallelism is preferable for large num-
14
bers of connections. They only show receive-side results for a single connection in their packet-level parallel implementation, and also ignore ordering issues.
2.3 Research Issues in Packet-Level Parallelism In this chapter of the dissertation, we determine what opportunities are available in packet-level parallelism, as well as any inherent limitations of this approach to parallelism in network protocols. In particular, we address the following issues:
What is the impact of ordering, both as seen by the application and by a particular protocol layer?
How does the granularity of locking aect performance? What is the parallelism available within a single connection? Within multiple connections?
How does structuring for cache anity behave? What is the impact of packet size on speedup? The impact of checksumming? In general, we show how these issues in uence performance and quantitatively evaluate their impact.
2.4 Experimental Description In order to examine these issues, we have developed a packet-level parallel x-kernel, which runs in user space on Silicon Graphics shared-memory multiprocessors using the IRIX operating system. As such, it is similar in several respects to the platform described by Bjorkman and Gunningberg [11, 12]. Our platform was, for the most part, developed independently, and for a dierent type of machine. The exception is the SICS MP TCP code, which we used to guide the design of our parallel TCP, as 15
described in Section 2.7. The SICS platform, however, was based on the February 1992 release of the x-kernel, and ran on the Sequent Symmetry. Our environment is based on the December 1993 x-kernel release, and runs on SGI multiprocessors. Given the dierences in hardware, host operating systems, versions of the x-kernel infrastructure and protocols, a direct comparison is thus not possible. Where applicable, however, we describe dierences between the systems.
2.4.1 Packet-Level Parallel x-kernel Our parallelized x-kernel was developed by adding locks into appropriate places in the x-kernel infrastructure. Like the SICS system, we placed locks protecting x-kernel infrastructure within the x-kernel, and placed locks concerning protocols within the protocols. Unlike the SICS system, which used a xed number of static, global locks, we instantiate locks on a dynamic per data-structure basis. The x-kernel's message tool is a facility for managing packet data, analogous to Berkeley mbuf's. Messages are per-thread data structures, and thus require no locks. They point to allocated data structures called MNodes which are reference counted; these reference counts must be incremented and decremented atomically. The x-kernel's map manager provides a mapping from an external identi er (e.g., a TCP port number) to an internal identi er (e.g., a TCP protocol control block), using chained-bucket hash tables with a 1-behind cache. Maps have many uses, but are primarily used for demultiplexing. They must be locked to insert, lookup, or remove entries. In addition, since the map manager provides an iterator function mapForEach(), the map manager can call itself recursively. To handle this recursion, counting locks are used, so that if a thread already owns the lock, it simply increments a count and proceeds. Similarly, an unlock decrements the count, and the lock is released when the count reaches zero.
16
The event manager uses a timing wheel [121] to manage events which are to occur in the future. The wheel is essentially another chained-bucket hash table, where the hashing function is based on the time that the event is scheduled to run. To protect this structure, we added per-chain locks, so that concurrent updates to the table were less likely to con ict with one another. Other components of the x-kernel require locks for various reasons, most frequently for atomic addition and subtraction for object reference counts.
2.4.2 Protocols The protocols we examine are from the core TCP/IP suite, those used in typical Internet scenarios. The execution paths we study are those that would be seen along the common case or \fast path" during data transfer of an application. We do not examine connection setup or teardown; in our experiments, connections are already established. In order to study experimentally various performance-related issues in parallel protocols, we implemented multiprocessor versions of FDDI, IP, UDP, and TCP. Here we describe our parallel implementations of these protocols. The Fiber Distributed Data Interface (FDDI) [43, 105] is a ber-optic token-ring based LAN protocol. The FDDI protocol in the x-kernel is very simple; it essentially prepends headers to outgoing packets and removes headers from incoming packets. Locking is only necessary in two instances: during session creation and on packet demultiplexing (to determine the upper-layer protocol to which a message should be dispatched to). No locking is required for outgoing packets during data transfer. The Internet Protocol (IP) [98] is the network-layer protocol that performs routing of messages over the Internet. IP is structured similarly to FDDI but has a slightly larger amount of state, which must be locked. On the send side, IP has a datagram identi er used for fragmenting packets larger than the network interface 17
maximum transmission unit (MTU). The identi er must be incremented atomically, per-datagram. On the receive side, if a packet is a fragment, a fragment table must be locked to serialize lookups and updates. The User Datagram Protocol (UDP) [97] is a connectionless datagram transport protocol that provides little beyond simple multiplexing and demultiplexing; it does not make guarantees about ordering, error control, or ow control. Like FDDI, locking is only required for session creation and packet demultiplexing. The Transmission Control Protocol (TCP) [99] is used by reliable Internet applications such as le transfer, remote login, and HTTP. It provides a connection-oriented service with reliable, in-order data delivery, recovers from loss, error, or duplication, and has built in ow control and congestion control mechanisms. As such, TCP is a much more complex protocol than UDP. Our TCP implementation is based upon the x-kernel's adaptation of the Berkeley Tahoe release, but was updated to be compliant with the BSD Net/2 software distribution. In addition to adding header prediction, this involved updating the congestion control and timer mechanisms, as well as reordering code in the send side to test for the most frequent scenarios rst [60]. The one change we made to the base Net/2 structure was to use a 32-bit ow-control window, rather than the 16-bit window de ned by the TCP speci cation. This turns out to be important in order to generate the high bandwidths in our experiments, and we note that 32-bit ow control information is used in both 4.4 BSD with large windows [17] and in the next-generation TCP proposals [16, 119]. Due to the semantics of TCP, the protocol consequently has a great deal of perconnection state, which must be locked in a packet-level parallel implementation to provide consistency and semantic correctness. For example, each connection has a retransmission queue, a re-assembly queue, and various congestion and ow-control windows for both the send and receive sides. Given this large amount of state, several state locking strategies are possible. We implemented three Net/2-based versions of 18
TCP, where each version uses a dierent locking granularity. These are described in more detail in Section 2.7. Checksumming has been identi ed as a potential performance issue in TCP/UDP implementations. Certain network interfaces, such as SGI's FDDI boards, have hardware support for calculating checksums that eectively eliminate the checksum performance overhead. However, not all devices have this hardware support. To capture both scenarios, we run experiments with checksumming on and o, to emulate checksums being calculated in software and hardware, respectively. For our software checksum experiments, the checksum code we use is the fastest available portable algorithm that we are aware of, which is from UCSD [66].
2.4.3 In-Memory Drivers Since our platform runs in user space, accessing the FDDI adaptor involves crossing the IRIX socket layer and the user/kernel boundary, which is prohibitively expensive. Normally, in a user-space implementation of the x-kernel, a simulated device driver is con gured below the media access control layer (in this case, FDDI). The simulated driver uses the socket interface to emulate a network device, crossing the user-kernel boundary on every packet. Since we wish to measure only our protocol processing software, we replaced the simulated driver with in-memory device drivers for both the TCP and UDP protocol stacks, in order to avoid this socket-crossing cost. The drivers emulate a high-speed FDDI interface, and support the FDDI maximum transmission unit (MTU) of slightly over 4K bytes. This is similar to approaches taken in [12, 48, 111]. In addition to emulating the actual hardware drivers, the in-memory drivers also simulate the behavior of a peer entity that would be at the remote end of a connection. That is, the drivers act as senders or receivers, producing or consuming packets as quickly as possible, to simulate the behavior of simplex data transfer over an 19
THROUGHPUT TEST
TCP
DATA FLOW
IP
ACKS
FDDI
SIM−TCP−RECV
Figure 2.4 TCP Send-Side Con guration error-free network. To minimize execution time and experimental perturbation, the receive-side drivers use preconstructed packet templates, and do not calculate TCP or UDP checksums. Instead, in experiments that use a simulated sender, checksums are calculated at the transport layer, but the results are ignored, and assumed correct. Figure 2.4 shows a sample protocol stack, in this case a send side TCP/IP con guration. In this example, a simulated TCP receiver sits below the FDDI layer. The simulated TCP receiver generates acknowledgment packets for packets sent by the TCP protocol above. The driver acknowledges every other packet, thus mimicking the behavior of Net/2 TCP when communicating with itself as a peer. Since spawning a thread is expensive in user space in IRIX, the driver \borrows" the stack of a calling thread to send an acknowledgment back up. The TCP receive-side driver (i.e., simulated TCP sender) produces packets inorder for consumption by the actual TCP receiver, and ow-controls itself appropriately using the acknowledgments and window information returned by the TCP 20
receiver. Both simulated TCP drivers also perform their respective roles in setting up a connection.
2.5 Baseline Results In this section we present a set of baseline results on our 8-processor 100 MHz Challenge machine. Our goal here is to illustrate the dierences between the send and receive paths, and the impact of checksumming and packet size on scalability. The baseline protocol implementations that generated these results include message caching, atomic increment/decrement, and (in the case of TCP) a single lock on the TCP state. The locks used are the SGI supplied mutex locks. In Sections 2.7 and 2.8 we describe these protocol structuring and implementation choices, and examine how they and various other alternative approaches aect and determine performance. In our experiments each processor has a single thread which is wired to that processor, similar to the method used by Bjorkman and Gunningberg. To see if wiring impacted our results, we ran several experiments without wiring threads to processors with TCP and UDP, send and receive side, with and without checksumming. The only change we observed was a small (approximately ten percent) dierence on the send side for UDP above 4 processors. IRIX 5.2 schedules for cache anity [9], and so we conclude that wiring has little perturbation of our experiments.
2.5.1 Send and Receive Side Processing Figure 2.5 shows UDP send-side throughput, for a single UDP connection, in Megabits per second, measured on our 8-processor Challenge machine. Figure 2.6 shows relative speedup for the send side in UDP, where speedup is normalized relative to the uniprocessor throughput (of the multiprocessor implementation) for that particular packet size. Figures 2.7 and 2.8 show UDP receive-side throughput and speedup, respectively. For these and all subsequent graphs, each data point is the av21
4K Byte Packets, Checksum Off 4K Byte Packets, Checksum On 1K Byte Packets, Checksum Off 1K Byte Packets, Checksum On 10 Byte Packets, Checksum Off 10 Byte Packets, Checksum On
1600
1400
4K Byte Packets, Checksum Off 4K Byte Packets, Checksum On 1K Byte Packets, Checksum Off 1K Byte Packets, Checksum On 10 Byte Packets, Checksum Off 10 Byte Packets, Checksum On
8
7
6
Relative Speedup
Throughput (MBits/sec)
1200
1000
800
600
5
4
3 400
2
200
0
1 1
2
3
4
5
6
7
8
1
2
3
Processors
4
5
6
7
8
7
8
Processors
Figure 2.5 UDP Send Thrpts
Figure 2.6 UDP Send Speedup
1200
8
4K Byte Packets, Checksum Off 4K Byte Packets, Checksum On 1K Byte Packets, Checksum Off 1K Byte Packets, Checksum On
1000
4K Byte Packets, Checksum Off 4K Byte Packets, Checksum On 1K Byte Packets, Checksum Off 1K Byte Packets, Checksum On
7
Relative Speedup
Throughput (MBits/sec)
6 800
600
5
4
400 3
200
2
0
1 1
2
3
4
5
6
7
8
1
Processors
Figure 2.7 UDP Recv. Thrpts.
2
3
4
5
6
Processors
Figure 2.8 UDP Recv. Speedup
22
erage of 10 runs, where a run consists of measuring the steady-state throughput for 30 seconds, after an initial 30 second warmup period. In addition, we isolated our Challenge multiprocessor as much as possible by running experiments with no other user activity. All non-essential daemons were removed, and the machine did not mount or export any remote le systems. To check variance, we ran one 8-processor test 400 times, and observed that the data t a normal bell-curve distribution. Throughput graphs include 90 percent con dence intervals. The gures show that, as Bjorkman and Gunningberg discovered [12], UDP sendside performance scales well with larger numbers of processors. In our discussion, scalability means the rst derivative of speedup as the last processor is added to the experiment. Note that a test can demonstrate high speedup to a point but exhibit poor scalability. We observe that send and receive side processing scale dierently, but we do not wish to claim any inherent dierence between their relative scalability. This is because our send-side experiments explicitly yield the processor on every packet, but the the receive side relies on the operating system to preempt the thread. This is partially a historical artifact of our implementation, and we plan a more detailed comparison between the two in the future. One major dierence between the send and receive paths is that a protocol's receive processing must demultiplex incoming packets to the appropriate upper layer protocol. At rst, we thought that the locks used in the map manager for demultiplexing might be creating a bottleneck, but running the test without locking the maps yielded a small (approximately 10 percent) improvement in throughput. The TCP throughput and speedup results, again for a single connection, are given in Figures 2.9-2.12 respectively. The TCP numbers here are from our baseline TCP, TCP-1, further described in Section 2.7.1. Our results show that TCP does not scale nearly as well as UDP, in either the send or receive case. Locking state is the culprit here. For example, pro ling with Pixie [114] shows that in an 8-processor receive-side 23
400
8 4K Byte Packets, Checksum Off 4K Byte Packets, Checksum On 1K Byte Packets, Checksum Off 1K Byte Packets, Checksum On
TCP1 4K Byte Packets, Checksum Off TCP1 4K Byte Packets, Checksum On TCP1 1K Byte Packets, Checksum Off TCP1 1K Byte Packets, Checksum On
7
300
Relative Speedup
Throughput (MBits/sec)
6
200
5
4
3 100
2
0
1 1
2
3
4
5
6
7
8
1
2
3
Processors
Figure 2.9 TCP Send Thrpts
400
4
5
6
7
8
Processors
Figure 2.10 TCP Send Speedup
8
4K Byte Packets, Checksum Off 4K Byte Packets, Checksum On 1K Byte Packets, Checksum Off 1K Byte Packets, Checksum On
7
4K Byte Packets, Checksum Off 4K Byte Packets, Checksum On 1K Byte Packets, Checksum Off 1K Byte Packets, Checksum On
300
Relative Speedup
Throughput (MBits/sec)
6
200
5
4
3 100
2
0
1 1
2
3
4
5
6
7
8
1
Processors
Figure 2.11 TCP Recv. Thrpts
2
3
4
5
6
7
Processors
Figure 2.12 TCP Recv. Speedup
24
8
test, 90 percent of the time is spent waiting to acquire the TCP connection state lock; on the send side, the amount is 85 percent. Several unusual points warrant mentioning. Figure 2.9 shows that send-side throughput appears to level o at around 215 megabits/sec. Figure 2.11 shows that receive-side throughput levels o above 350 megabits/sec, but then drops o suddenly afterwards. This dip is caused by the combination of TCP packets being misordered when threads contend for the connection state lock, and the dierence in processing times for in-order versus out-of-order packets in TCP. Section 2.6.1 discusses how this problem was discovered, as well as our solution.
2.5.2 Checksumming and Packet Size As discussed earlier, we were interested in how checksumming and packet size in uence the performance of parallel protocols. Our expectations were that relative speedup would be greater when processing larger packets with checksumming, since checksumming occurs outside of locked regions and thus constant per-packet costs would constitute a smaller fraction of the processing time [65]. Figures 2.6, 2.8, 2.10, and 2.12 show that, in general, tests with larger packets have better speedup than those with smaller ones, and experiments for a particular packet size with checksumming have better speedup than those without, although the dierences are not as pronounced as we had expected. The trends agree somewhat with those shown in [48], which showed better speedup with larger data units. However, their tests included presentation-layer conversion, which is much more compute-bound and data-intensive than checksumming. Although the SGI documentation gives the aggregate bus bandwidth as 1.2 gigabytes/sec, we wished to see the read bandwidth limitations imposed by checksumming. To this end, we ran a micro-benchmark that checksummed over a large amount of data, to force cache misses. We observed that each processor could checksum at a 25
rate of 32 MB/sec, or 256 megabits/sec, at least up to 8 processors. Assuming the bandwidth does not degrade as processors are added, this implies that the bus could support up to 38 processors doing nothing but checksumming.
2.6 Ordering Issues 2.6.1 Ordering Issues in TCP As evidenced in Figure 2.11, receive-side TCP throughput falls drastically beyond 4 or 5 processors. Further investigation showed large numbers of out-of-order arrivals at the TCP layer, a surprising result since data was being generated in-order by the simulated TCP sender. As the TCP header prediction algorithm is dependent on the arrival of in-order packets, we hypothesized that out-of-order arrivals were reducing performance. To test this hypothesis, we ran a test using a version of TCP modi ed to treat every packet as if it were in-order, regardless of whether it actually was in-order. The result was the disappearance of the anomaly. The question then became how to bridge the gap between the observed behavior and the forced in-order experiment. The Pixie results showed high contention for the connection state lock, and since the raw mutex locks provided by IRIX are not FIFO, this suggested that lock contention was causing threads, and thus packets, to be reordered. To preserve the original ordering, we implemented FIFO queuing using the MCS locks of Mellor-Crummey and Scott [79]. Their locking algorithm requires atomic swap and compare-and-swap functions, which we implemented using short R4000 assembler routines. Figure 2.13 illustrates the eects, using 4 KB packets with checksumming on. The top curve in the gure is from the modi ed TCP where packets are assumed to be in order, a potential upper bound. The bottom curve is the baseline TCP-1 implementation using regular mutex locks for the connection state. The middle curve is TCP-1 using MCS locks. We see that using these locks bridges the majority of the gap between the baseline case and the \upper bound." 26
450 TCP-1 Assumed In-Order TCP-1 MCS Locks TCP-1 Mutex Locks
400
Throughput (MBits/sec)
350
300
250
200
150
100
50
0 1
2
3
4
5
6
7
8
Processors
Figure 2.13 TCP Ordering Eects (Checksum On)
450 TCP-1 Assumed In-Order TCP-1 MCS Locks TCP-1 Mutex Locks
400
Throughput (MBits/sec)
350
300
250
200
150
100
50
0 1
2
3
4
5
6
7
8
Processors
Figure 2.14 TCP Ordering Eects (Checksum O)
27
Table 2.1 Percentage of packets out-of-order
Processors 1 2 3 4 5 6 7 8 Mutex Locks 00 02 04 05 11 25 42 54 MCS Locks 00 02 04 06 09 11 14 18 Figure 2.14 shows the throughputs for the experiment with checksumming disabled. We see in this case, there is no statistically discernible dierence between the performance of the \upper bound" TCP and TCP with MCS locks. Closing the remainder of the gap with checksumming is not trivial. For example, we tried a receive-side test where map lookups were serialized by MCS locks, and observed a slight reduction in throughput. Since MCS locks have a greater xed-overhead cost without contention than the straight mutex locks (1.5 secvs. 0.7 sec), we did not wish simply to replace all mutex locks in the system with MCS locks. However, as observed above, in the right scenario they can create a signi cant performance improvement. The eight-processor throughputs in Figure 2.13 are slightly less than the sevenprocessor throughputs, leading to the speculation that throughputs might continue to degrade for larger numbers of processors. We investigate this hypothesis in Section 2.10. Table 2.1 also shows the impact of using FIFO locks. The table gives the percentage of packets received out-of-order in TCP with mutex locks and MCS locks, for a receive-side test using 4KB packets with checksumming. The table shows a large dierence in the number of out-of-order packets between the two locking schemes as the number of processors increases. An interesting side issue is the misordering that can occur on the send side when threads pass each other below TCP but before reaching the FDDI driver. This would cause packets to be placed out-of-order on the wire, and probably arrive out-of-order 28
at the receiver. To quantify this potential problem, we measured the percentage of out-of-order packets in the send-side driver, and observed that fewer than one percent were misordered with up to eight processors.
2.6.2 Ordering and Correctness We note that preserving order is a semantic correctness issue. If an application uses TCP and cannot cope with out-of-order delivery, packet order must also be preserved above TCP. When parallelism is introduced, an arriving packet cannot simply release the TCP connection state lock and continue; the moment the lock is relinquished, guarantees of ordered data above TCP are lost. Similarly, on the send side, the order of the data the application passes to TCP must be preserved, lest the byte order TCP preserves is dierent from the one the application supplied. For some applications, this is not a problem. For example, NFS does not assume ordered packets (although it does require bytes within a datagram to be ordered), and can be con gured to use TCP. In most cases, however, the application requires order to be preserved. To force in-order delivery up through the application, we added a ticketing scheme similar to a bakery algorithm to our TCP. Before releasing the TCP connection state lock, a receiving thread acquires an up-ticket for the next higher layer. The thread then releases the connection state lock, and continues up the stack. In the test application above TCP, at the point where the application requires order, the thread can then wait for its ticket to be called. The amount of mechanism required to implement this feature is not large, but restricts order, further limiting performance. Figure 2.15 shows a receive-side TCP throughput test using 4KB packets, comparing an application that requires order preservation versus one that does not. In this example, the application is our test code, which simply counts packets that arrive.
29
450
Checksum Off, No Ticketing Checksum On, No Ticketing Checksum Off, With Ticketing Checksum On, With Ticketing
400
Throughput (MBits/sec)
350
300
250
200
150
100
50
0 1
2
3
4
5
6
7
8
Processors
Figure 2.15 Ticketing Eects in TCP The application's critical section itself is small, a lock-increment-unlock sequence; the performance is lost preserving the order. We are not the rst to observe this problem [48, 57], but to our knowledge, previous work has not provided adequate solutions. For example, in [48], Goldberg et al. use a ticketing scheme similar to ours, but assign tickets to packets at the driver for use in re-ordering at the application. However, this assumes a one-to-one correspondence between arriving packets and application data units. It does not address issues such as corrupted packets that are dropped, fragmented packets that are reassembled, or packets that contain no data at all, such as acknowledgements. The more general problem is to provide a mechanism that is correct in a general fashion, across several protocol layers. The solution we describe above only solves the problem when there is a one-to-one correspondence between a TCP connection and the application's notion of a connection. This is the case in the example of TCP and BSD sockets. However, if a TCP connection was multiplexed by several other higher-layer protocols, each message must be \re-ticketed" at each multiplexing or de30
multiplexing point1 . A general solution that meshes with the x-kernel's infrastructure is an issue still under study.
2.6.3 Multiple Connections Given the performance penalty exacted for maintaining order, and the singleconnection performance limits in TCP, we argue that if parallel applications are to reap the bene ts of parallelized networking, they should perform their own ordering. Using either a connectionless protocol such as UDP or a connection-oriented protocol such as TCP with multiple connections, an application must be able to handle outof order delivery. Lindgren et al. [74] make a related argument that the parallel application must be tied closely to the parallel communication system. To illustrate the bene ts of using multiple connections, we ran send-side and receive-side experiments of TCP-1 with MCS locks, without ticketing, using 4KB packets with and without checksumming. In these tests, each processor was responsible for a separate connection. For example, the eight processor experiment examines throughput for eight connections. The simulated drivers were modi ed slightly to support multiple connections for these tests. The results are shown in Figure 2.16. The graph shows steadily increasing throughput as connections (and their associated processors) are added. This test is somewhat \idealized" in that the distribution of trac across connections is uniform { all connections send data as fast as they can. However, the point of the experiment is to show that the connection state lock is the major bottleneck for a single connection, and that it may be overcome by using multiple connections. 1
Thanks to Mats Bjorkman for pointing this out.
31
700 Recv-side, Checksum Off Recv-side, Checksum On Send-side, Checksum Off Send-side, Checksum On
600
Throughput (MBits/sec)
500
400
300
200
100
0 1
2
3
4
5
6
7
8
Processors
Figure 2.16 TCP with Multiple Connections
2.7 Locking Issues 2.7.1 Locking Granularity in TCP Recall that TCP maintains a relatively large amount of state per connection. As we discussed earlier, a question we wished to address was how that state should be locked in order to maximize performance and speedup. Towards this end, we produced three versions of our TCP, each with a dierent number of locks. For illustrative purposes, we call them TCP-N, where N indicates the number of locks involving connection state. The rst version, the baseline given in Section 2.5, is TCP-1, which uses only a single lock to protect all connection information. The second version, TCP-2, uses two locks per connection: one to protect send-side state, and the other to protect receive-side state. The last version, TCP-6, uses the locking style from the SICS MP TCP, with six locks serializing access to various components of the connection state.
32
More speci cally, TCP-6 has separate locks to protect the receive-side re-assembly queue, the send-side retransmission buer, the header prepend operation, header remove operation, send-side window state, and receive-side window state. In most cases this locking is either redundant or unnecessary. For example, header manipulation occurs solely on the stack of the calling thread; thus, no locking is necessary. Similarly, the send and receive queues need to be locked at the same time as the send and receive window state, which is redundant. Another concern we had with the SICS TCP implementation was that locks were being held where checksum calculation would have been done, on both incoming and outgoing packets.2 In the x-kernel, this occurs where headers are prepended or removed, respectively, and the TCP-6 code is consistent with their implementation. However, we saw that locking was not necessary here, and our two other TCP implementations re ect this. The key realization is that checksumming a packet is orthogonal to manipulating connection state. The only change needed was, in the case of the outbound processing in tcp output, the checksum calculation had to be moved so that it was done outside the scope of the send window lock. However, this did not aect correctness. The results for the three TCP implementations are given in Figures 2.17 and 2.18, which plot send and receive side throughput respectively with checksumming. The three TCPs measured here are based on the baseline version described in Section 2.5 with the addition of MCS locks. The goal here is simply to compare locking strategies. TCP-1 and TCP-2 both outperform TCP-6, particularly when checksumming is enabled. With checksumming o, the gaps are smaller, but the relative ordering between the three TCP's is the same. In all cases, send and receive side, with and without checksumming, the code with the simplest locking, TCP-1, performed the best. We also observed this behavior when the three TCP's did not include MCS 2
We note that Bjorkman and Gunningberg reported results for TCP without checksumming.
33
400 TCP-1 4KB Packets TCP-2 4KB Packets TCP-6 4KB Packets TCP-1 1KB Packets TCP-2 1KB Packets TCP-6 1KB Packets
Throughput (MBits/sec)
300
200
100
0 1
2
3
4
5
6
7
8
Processors
Figure 2.17 TCP Send-Side Locking Comparison
400 TCP-1 4KB Packets TCP-2 4KB Packets TCP-6 4KB Packets TCP-1 1KB Packets TCP-2 1KB Packets TCP-6 1KB Packets
Throughput (MBits/sec)
300
200
100
0 1
2
3
4
5
6
7
8
Processors
Figure 2.18 TCP Receive-Side Locking Comparison
34
locks. In retrospect, we can see that the single-lock version would indeed perform the best, since the Net/2 TCP implementation manipulates send-side state on the receive path, and receive-side state on the send path. For example, in the TCP header prediction code (intended to be common-case processing), both the send and receive state locks must be acquired. Another attractive feature of using a single lock is its simplicity. Implementation is easier, deadlock is easier to avoid, and atomicity of changes to protocol state is easier to guarantee. We note that this result is speci c to the BSD implementation, and that a TCP implementation designed around separating send and receive side processing may well yield better speedup with multiple locks. However, due to the widespread use of the BSD code, our Net/2 results are applicable to many operating systems.
2.7.2 Atomic Increment and Decrement Another locking issue we examined was using atomic increment and decrement functions that exploited the R4000's load-linked (LL) and store-conditional (SC) instructions. LL and SC allow programmers to produce lock free primitives [52]. A simple example of this is atomic increment, which replaces a lock-increment-unlock sequence. We tried this for two reasons. First, the x-kernel's message tool relies on the notion that reference counts are atomically manipulated, and so these primitives map perfectly with the existing code. Thus, the primitives bene t the message tool, and subsequently all protocols that use it. Second, the x-kernel uses reference counts on session and protocol state in order to know when objects can be freed. When a packet is demultiplexed, these reference counts are incremented on the way up the stack and then decremented on the way down. This means that two locks are acquired and
35
400 Recv-side, Atomic Ops Recv-side, No Atomic Ops Send-side, Atomic Ops Send-side, No Atomic Ops
Throughput (MBits/sec)
300
200
100
0 1
2
3
4
5
6
7
8
Processors
Figure 2.19 TCP Atomic Operations Impact released per-layer, on the fast path of data transfer. Thus, fast atomic primitives again potentially bene t the entire protocol stack. Replacing a lock-increment-unlock sequence with atomic increment pays o in two ways. First, a layer of procedure call is removed, which can aect performance on the fast path. Second, in the best case, it memory trac is reduced by replacing three writes with a single one. We implemented these primitives with short R4000 assembler routines. Sample results are given in Figure 2.19, which shows the eects of atomic primitives on TCP throughputs with 4KB packets and checksumming on. Both TCP and UDP see improvements with the atomic primitives. The UDP receiveside obtains a larger bene t than the send side from atomic increments, due to the reference count manipulation that happens during demultiplexing. The bene ts to the TCP send and receive sides were approximately equal, as the majority of the improvement is due to a more ecient message tool.
36
400 Recv-side, Messages Cached Recv-side, Messages Not Cached Send-side, Messages Cached Send-side, Messages Not Cached
Throughput (MBits/sec)
300
200
100
0 1
2
3
4
5
6
7
8
Processors
Figure 2.20 TCP Message Caching Impact
2.8 Per-Processor Resource Caching As mentioned earlier, x-kernel protocols make heavy use of the message tool to manipulate packets. Since caching has been shown to be eective in data structure manipulations [3, 39], we decided to evaluate the use of simple per-thread resource caches in the message tool. Whenever a thread requires a new MNode (the message tool's internal data representation), it rst checks a local cache, which can be done without locking. The cache is managed last-in rst-out (LIFO) to maximize cache anity. This avoids contention in two ways: rst, the lock in malloc serializing memory allocation is avoided, reducing locking contention and possible system calls (e.g., sbrk). Second, memory freed by a processor is re-used by that processor, avoiding memory contention. Figure 2.20 gives a sample of the results, displaying TCP throughputs with 4 KB packets with checksumming. The improvement in TCP is signi cant, due to its heavy use of the message tool. The results are also positive for UDP send and receive side.
37
2.9 Architectural Trends Of the experiments given in Section 2.5 that coincide with those given by Bjorkman and Gunningberg on the Sequent, we observed similar trends but relatively lower speedups. Drawing conclusions based on comparing speedups between two largely dierent architectures would most likely be inappropriate. Still, we were curious as to the dierences that hardware made, since the Sequent used in their experiments was an older machine. Although we could not compare our results with theirs directly, we thought it would be illustrative to run our code on older hardware. To this end, we ran the same experiments on a Power Series, the previous generation Silicon Graphics multiprocessor, with four 33 MHz MIPS R3000's. We also ran our experiments on a faster version of our machine, a four-processor Challenge using 150 MHz R4400's. In all cases, the machines ran version 5.2 of the IRIX operating system. In these additional tests, we did not have exclusive access to the other machines, and so were not able to isolate them as carefully as with our Challenge experiments. However, we did run all our tests with minimal other activity on the systems, and the width of the con dence intervals on these graphs shows that variance is low. Examples of the architectural comparisons are given in Figures 2.21 - 2.28, which show TCP and UDP send-side and receive-side throughputs and speedup on the three platforms. In general, our ndings were consistent across platforms. We do not wish to draw broad conclusions, especially from machines with only four processors, but we can summarize our observations:
On all platforms, TCP-1 outperformed TCP-2 and TCP-6. On all platforms, UDP send-side scaled well, and TCP scaled poorly. On all platforms, the fastest machine had the highest throughput for a particular test. 38
500
4
R4400 MP (150MHz), Checksum Off R4400 MP (150MHz), Checksum On R4400 MP (100MHz), Checksum Off R4400 MP (100MHz), Checksum On R3000 MP ( 33MHz), Checksum Off R3000 MP ( 33MHz), Checksum On
R4400 MP (150MHz), Checksum Off R4400 MP (150MHz), Checksum On R4400 MP (100MHz), Checksum Off R4400 MP (100MHz), Checksum On R3000 MP ( 33MHz), Checksum Off R3000 MP ( 33MHz), Checksum On
3.5
3 300
Speedup
Throughput (MBits/sec)
400
2.5
200 2
100 1.5
0
1 1
1.5
2
2.5
3
3.5
4
1
1.5
2
Processors
Figure 2.21 TCP Send Thrpts.
500
2.5
3
3.5
4
Processors
Figure 2.22 TCP Send Speedups
4
R4400 MP (150MHz), Checksum Off R4400 MP (150MHz), Checksum On R4400 MP (100MHz), Checksum Off R4400 MP (100MHz), Checksum On R3000 MP ( 33MHz), Checksum Off R3000 MP ( 33MHz), Checksum On
R4400 MP (150MHz), Checksum Off R4400 MP (150MHz), Checksum On R4400 MP (100MHz), Checksum Off R4400 MP (100MHz), Checksum On R3000 MP ( 33MHz), Checksum Off R3000 MP ( 33MHz), Checksum On
3.5
3 300
Speedup
Throughput (MBits/sec)
400
2.5
200 2
100 1.5
0
1 1
1.5
2
2.5
3
3.5
4
1
Processors
Figure 2.23 TCP Recv. Thrpts.
1.5
2
2.5
3
3.5
Processors
Figure 2.24 TCP Recv. Speedups
39
4
1.4⋅103
4
R4400 MP (150MHz), Checksum Off R4400 MP (150MHz), Checksum On R4400 MP (100MHz), Checksum Off R4400 MP (100MHz), Checksum On R3000 MP ( 33MHz), Checksum Off R3000 MP ( 33MHz), Checksum On
1.2⋅103
R4400 MP (150MHz), Checksum Off R4400 MP (150MHz), Checksum On R4400 MP (100MHz), Checksum Off R4400 MP (100MHz), Checksum On R3000 MP ( 33MHz), Checksum Off R3000 MP ( 33MHz), Checksum On
3.5
3 800
Speedup
Throughput (MBits/sec)
103
600
2.5
2 400
1.5
200
0
1 1
1.5
2
2.5
3
3.5
4
1
1.5
2
Processors
Figure 2.25 UDP Send Thrpts.
1.4⋅103
3
3.5
4
Figure 2.26 UDP Send Speedups
4
R4400 MP (150MHz), Checksum Off R4400 MP (150MHz), Checksum On R4400 MP (100MHz), Checksum Off R4400 MP (100MHz), Checksum On R3000 MP ( 33MHz), Checksum Off R3000 MP ( 33MHz), Checksum On
1.2⋅103
2.5
Processors
R4400 MP (150MHz), Checksum Off R4400 MP (150MHz), Checksum On R4400 MP (100MHz), Checksum Off R4400 MP (100MHz), Checksum On R3000 MP ( 33MHz), Checksum Off R3000 MP ( 33MHz), Checksum On
3.5
3 800
Speedup
Throughput (MBits/sec)
103
600
2.5
2 400
1.5
200
0
1 1
1.5
2
2.5
3
3.5
4
1
Processors
Figure 2.27 UDP Recv. Thrpts.
1.5
2
2.5
3
3.5
Processors
Figure 2.28 UDP Recv. Speedups
40
4
Speedup was consistently best on the Power Series (the oldest machine) and about the same on the two Challenge platforms.
The two Challenge machines exhibited the receive-side drop in throughput at 2 processors, but the Power Series did not. In particular, UDP receive-side performance scales on the Power Series as far as could be observed, namely up to four processors. The last item is perhaps the most interesting. Without more detailed information, we cannot assert any explanations for the behavior. We do note though, that the Power Series performs locking using a separate dedicated synchronization bus, similar to the Sequent. The Challenge, however, uses memory to synchronize, relying on the coherency protocol and the load-linked/store-conditional instructions [44]. Given that Bjorkman and Gunningberg did not observe the receive-side drop for their UDP receive side tests on the Sequent, we suspect that the dierence in synchronization may be the cause of the anomaly. We are pursuing further studies along these dimensions. Finally, the 100 MHz Challenge uniprocessor throughputs are roughly 25 to 50 percent better than those of the 33MHz Power Series. This is surprising, given that the former has a three times faster clock cycle, on-chip caches, and larger secondary caches. This is only one architectural comparison, with dierent generations of both the MIPS architecture and multiprocessor interconnects. Still, it suggests that network protocol processing speed may not be improving as fast as application performance, which agrees with the operating system trends shown in [4, 92].
2.10 20 Processor Results We were able, courtesy of SGI, to obtain brief access to a 20 processor version of our SGI Challenge that had 150 MHz MIPS R4400 processors. Although we did 41
700
350 4K Byte Packets, Checksum Off 4K Byte Packets, Checksum On 1K Byte Packets, Checksum Off 1K Byte Packets, Checksum On
300
600
500
Throughput (MBits/sec)
250
Throughput (MBits/sec)
4K Byte Packets, Checksum Off 4K Byte Packets, Checksum On 1K Byte Packets, Checksum Off 1K Byte Packets, Checksum On
200
150
400
300
100
200
50
100
0
0 0
5
10
15
20
0
Processors
Figure 2.29 TCP Send Thrpts.
5
10
15
20
Processors
Figure 2.30 TCP Receive Thrpts.
not have time to run all of our experiments, we were able to test some of our results on this platform. Given that this machine had 150 MHz processors, the throughputs should not be directly compared to our 100 MHz 8 processor throughputs. Instead, we use this machine to see how our results scale up to 20 processors. Figures 2.29 and 2.30 show the 20 processor throughputs for the single-connection TCP send and receive sides, respectively. We see that TCP still exhibits limited singleconnection parallelism. Recall that, in Figure 2.13, the 8 processor throughputs were less than the corresponding 7 processor throughputs, suggesting that performance might fall rapidly as more processors are added. Figures 2.29 and 2.30 show that performance degrades gracefully as more processors are used. Figures 2.31 and 2.32 present the 20 processor results for the single-connection UDP send and receive sides, respectively. Here we see that throughput increases until about 10 processors are used, and then falls afterwards, implying that a bottleneck has been reached. Unfortunately, we did not have the 20 processor machine for sucient time to determine the bottleneck. Figure 2.33 shows the locking comparison of gure 2.17 on our 20 processor machine. We see that the single-lock TCP implementation still performs best. 42
2200
2000
4K Byte Packets, Checksum Off 4K Byte Packets, Checksum On 1K Byte Packets, Checksum Off 1K Byte Packets, Checksum On
2000
4K Byte Packets, Checksum Off 4K Byte Packets, Checksum On 1K Byte Packets, Checksum Off 1K Byte Packets, Checksum On
1800 1500
Throughput (MBits/sec)
1400 1200 1000 800 600
1000
500
400 200 0
0 0
5
10
15
20
0
5
10
Processors
Figure 2.32 UDP Receive Thrpts.
350
TCP-1 4KB Packets TCP-2 4KB Packets TCP-6 4KB Packets
300
250
200
150
100
50
0 0
15
Processors
Figure 2.31 UDP Send Thrpts.
Throughput (MBits/sec)
Throughput (MBits/sec)
1600
5
10
15
20
Processors
Figure 2.33 TCP Locking Comparison
43
20
1.4⋅103 Recv-side, Checksum Off Recv-side, Checksum On Send-side, Checksum Off Send-side, Checksum On
1.2⋅103
Throughput (MBits/sec)
103
800
600
400
200
0 0
5
10
15
20
Processors
Figure 2.34 TCP Multiple Connections Figure 2.34 shows the multiple connection experiment of gure 2.16 using up to 20 processors. We see that using multiple connections still results in greater performance, at least up to 15 processors.
2.11 Conclusions We brie y summarize our ndings as follows:
Preserving order pays. We showed that, in cases where contention for locks perturbs order, simple FIFO queuing locks preserve this order, which improves performance.
Single-connection TCP parallelism is limited, both on the receive side, and on the send side, even more so than shown by Bjorkman and Gunningberg.
Multiple-connection TCP parallelism can scale, since contention for the connection state lock is avoided. However, the application must manage order across connections. 44
Exploiting cache anity, and avoiding contention, is signi cant. This is demonstrated by the eectiveness of per-processor resource caching. Contemporary machines are memory-bound, due to the disparity between CPU and memory speeds, and the gap is only expected to grow. We explore this issue in more detail in Chapter 4.
Simpler locking is better. We showed that, on a modern machine, locking structure impacts performance, and that a complex protocol with large connection state yields better speedup with a single lock than with multiple locks.
Atomic primitives can make a big dierence. Replacing sequences of lockincrement-unlock with an atomic increment improved receive-side TCP and UDP performance by about 20 percent on average, and send-side between 5 and 10 percent.
Speedup is in uenced by the use of checksumming. This was demonstrated by experiments with checksumming exhibiting larger speedup than those without.
Packet size in uences speedup. This was demonstrated by experiments with larger packet sizes generally exhibiting larger speedup, particularly those using checksumming. These results indicate that packet-level parallelism is especially bene cial for connectionless protocols, but that connection-oriented protocols will have limited bene ts in speedup within a single connection. Applications will need to use multiple connections to obtain parallel performance with connection-oriented protocols. This implies that they will also have to manage order between connections.
45
CHAPTER 3 SUPPORT FOR SECURE SERVERS Security and privacy is a growing concern in the Internet community, because of the Internet's rapid growth and the desire to conduct business over it safely. This desire has led to the advent of several proposals for security standards, such as secure IP [6], Secure HTTP (SHTTP) [101], and the Secure Socket Layer (SSL) [53]. Thus, the need to use encryption protocols such as DES and RSA is increasing.
3.1 Introduction One problem with using cryptographic protocols is the fact that they are slow. An important question then is whether security can be provided at gigabit speeds. The standard set of algorithms required to secure a connection includes a bulk encryption algorithm such as DES [2], a cryptographic message digest such as MD5 [102], a key exchange algorithm such as Die-Hellman [34] to distribute a private key securely, and some form of digital signature algorithm to authenticate the parties (e.g., RSA [103]). The encryption and hash digest algorithm must be applied to every packet going across a link to make it secure, and therefore the performance of these algorithms directly aects the achievable throughput of an application. Furthermore, there are some services, such as strong sender authentication in multicast, that require the use of expensive algorithms such as RSA signatures on every packet [86]. For non-multicast services, the most costly algorithms, RSA and Die-Hellman, need only be run at connection set-up time. They only aect the overall bandwidth if the 46
connections are short, the algorithms are particularly expensive, and/or the overhead of using these algorithms on busy servers reduces overall network performance. A straightforward approach to improving cryptographic performance is to implement cryptographic algorithms in hardware. This approach has been shown to improve cryptographic performance of single algorithms (e.g., DEC has demonstrated a 1 Gbit/sec DES chip [40]). Unfortunately there are several problems with this approach, which indicate why one would desire to do cryptography in software:
Variability. Systems need to support suites of cryptographic protocols, not just DES. For example, both SSL and SHTTP allow the use of RSA, DES, TripleDES, MD5, RC4, and IDEA. Hardware support for all of these algorithms is unlikely.
Flexibility. Standards change over time, and this is one reason why the IP security architecture and proposed key exchange schemes are designed to handle multiple types of authentication and con dentiality algorithms.
Security. Algorithms can be broken, such as knapsack crypto-systems, 129 digit RSA, and 192 bit DHKX. It is easier to replace software implementations of cryptographic algorithms than hardware implementations.
Cost. Custom hardware is not cheap, ubiquitous, or exportable. Performance. Certain algorithms, such as MD5 and IDEA, are designed to run quickly in software on current microprocessor architectures. Specialized hardware for these algorithms may not improve performance much. Touch [118] gives an analysis of this for MD5. Many approaches are available for improving software-based cryptographic support [85], including improved algorithm design and algorithm-independent hardware 47
support. In this chapter, we demonstrate how parallelism can improve software cryptographic performance for secure servers. We show that parallelism is an eective vehicle for improving software cryptographic performance, showing linear speedup for several dierent Internet-based cryptographic protocol stacks using two dierent approaches towards parallelism. Our approaches consist of parallelized implementations of the x-kernel [58] extended for packet-level [87] and connection-level [123] parallelism. We present results on a 12 processor 100MHz MIPS R4400 SGI Challenge.
3.2 Survey of Related Work Much of the related work has been described above in Section 3.1 and in Chapter 2 in Sections 2.1 and 2.2. Here we discuss research relating to connection-level parallelism and cryptographic protocols. The most comprehensive study to date comparing dierent approaches to parallelism on a shared-memory multiprocessor is by Schmidt and Suda [111, 112]. They show that packet-level parallelism and connection-level parallelism generally perform better than layer parallelism, due to the context-switching overhead incurred crossing protocol boundaries using layer parallelism. In [112], they suggest that packet-level parallelism is preferable when the workload is a relatively small number of active connections, and that connection-level parallelism is preferable for large numbers of connections. In another piece of work, Yates et al. [123] examine connection-level parallelism (CLP) in depth, using an implementation of the x-kernel extended to support CLP, and show performance results on a 20 processor SGI Challenge. They nd that in connection-level parallelism, throughput scales nearly linearly as the number of processors and connections are increased. They observe that throughput is mostly sustained as larger and larger numbers of connections are used. They nd that the 48
best aggregate performance is achieved by matching the number of threads in the system to the number of connections (in thread-per-connection), but that the best fairness results from matching the number of threads to the number of processors (processor per connection).
3.3 Research Issues The research issues we address in this section are the following:
Is parallelism an appropriate vehicle for improving cryptographic protocol performance?
How do the throughputs and speedups of various cryptographic protocols (e.g., DES, 3-DES, MD5) compare with one another?
How do the throughputs and speedups of cryptographic protocol stacks compare to stacks without security protocols?
Both packet-level parallelism and connection-level parallelism have advantages and disadvantages when used without secure protocols. How does adding cryptographic protocol processing change these advantages and disadvantages? In this chapter, we examine how parallelism can be used to improve software encryption protocol performance. We show this by demonstrating speedup of several dierent Internet-based cryptographic protocol stacks using both packet-level parallelism and connection-level parallelism.
3.4 Protocols Used The protocols used in our experiments are those that would be used in typical secure Internet scenarios, as seen along the common case or \fast path" during data transfer of an application such as a secure World-Wide-Web (WWW) server. We 49
focus on available throughput; we do not examine connection setup or teardown, or the attendant issues of key exchange. In these experiments, connections are assumed to have been established, and keys are assumed to be available as needed. We use the terminology de ned by the secure IP standard [6]. Authentication is the property of knowing that the data received is the same as the data sent by the sender, and that the claimed sender is in fact the actual sender. Integrity is the property that the data is transmitted from source to destination without undetected alteration. Con dentiality is the property that the intended participants know what data was sent, but any unintended parties do not. Encryption is typically used to provide con dentiality. The baseline protocol stack is an Internet-based TCP/IP/FDDI stack as discussed in Chapter 2. In addition, we examine stacks with various cryptographic protocols added at dierent layers in the hierarchy. MD5 is a message digest algorithm that computes a cryptographic checksum over the body of a message, and is used for authentication and message integrity. MD5 is a `required option' for secure IP, where required option means that an application's use of MD5 is optional, but that a compliant secure IP implementation is required to make the option available. MD5 is also the default message digest algorithm proposed for SHTTP, and is also used in SSL. In our implementation we use the standard MD5 message digest calculation, rather than the keyed one. DES is the ANSI Data Encryption Standard, used for con dentiality, and is one of the required protocols used in secure IP, SHTTP, and SSL. We use DES in cipher-block-chaining (CBC) mode. 3-DES is \triple DES," which runs the DES algorithm 3 times over the message. 3-DES is an option in both SHTTP and SSL. Our protocols are taken from the cryptographic suite available with the x-kernel [91].
50
3.5 Parallel Infrastructure The packet-level parallel protocol implementation is described in Chapter 2. Here we outline the implementation for connection-level parallelism; more details on that are available in [123]. In the connection-level parallel testbed, threads are assigned to connections on one-to-one basis, called thread-per-connection. The entire connection forms the unit of concurrency, only allowing one thread to perform protocol processing for any particular connection. Arriving packets are demultiplexed to the appropriate thread via a packet- lter mechanism [82]. The notion of a connection is extended through the entire protocol stack. Where possible, data structures are replicated per-thread, in order to avoid locking and contention. The two schemes do have some implementation dierences that are necessitated by their respective approaches to concurrency. However, wherever possible, we have made the implementations consistent. The one major change this entailed was updating the packet-level parallel TCP to include some of the BSD 4.4 xes, but none of the RFC1323 extensions [17].
3.6 Packet-Level Parallel Results Figure 3.1 shows the sender's throughput rate for our packet-level parallel (PLP) experiments. Throughputs are presented in megabits per second for several protocol stacks. Our baseline Internet stack consists of TCP/IP/FDDI, representing protocol processing without any security. A second con guration is an Internet stack with MD5 between TCP and IP, representing the work done for an application that requires authentication and integrity but no con dentiality. Our third stack uses DES above TCP and MD5 below TCP, which supports both con dentiality and integrity. We
51
300
12
TCP/IP TCP/MD5/IP DES/TCP/MD5/IP 3-DES/TCP/IP
10
200
8
Speedup
Throughput (MBits/sec)
250
Ideal Linear Speedup 3-DES/TCP/IP DES/TCP/MD5/IP TCP/MD5/IP TCP/IP
150
6
100
4
50
2
0
0 0
2
4
6
8
10
12
0
Processors
2
4
6
8
10
12
Processors
Figure 3.1 PLP Send Thrpts.
Figure 3.2 PLP Send Speedup
also use a stack with triple-DES instead of DES. The throughputs are for a single connection using 4KB packets. These throughputs were measured on our 12-processor Challenge machine, using a single TCP connection with 4 KB packets. Figure 3.2 shows the corresponding relative speedup for the two TCP stacks, where speedup is normalized relative to the uniprocessor throughput for the appropriate stack. Again, each data point is the average of 10 runs, and throughput graphs include 90 percent con dence intervals, Figure 3.1 quanti es the slowdown due to the use of cryptographic protocols. The baseline speed for the send-side TCP stack is roughly 138 Mbits/sec. Adding MD5 to the stack reduces throughput by nearly an order of magnitude, to a mere 18 Mbits/sec1 . Adding DES on top of TCP reduces throughput nearly 2 orders of magnitude, to 4.6 Mbits/sec. Using Triple-DES is 3 times slower at 1.5 Mbits/sec. Figure 3.2 shows the speedup for the send-side tests. The theoretical ideal linear speedup is included for comparison. Our earlier work in Chapter 2 has shown limited performance gains when using packet-level parallelism for a single TCP conMD5 runs 30-50% slower on big-endian hosts, such as our Challenge, rather than on little-endian hosts [118]. 1
52
400
12 TCP/IP} TCP/MD5/IP} DES/TCP/MD5/IP} 3-DES/TCP/IP}
Ideal Linear Speedup 3-DES/TCP/IP DES/TCP/MD5/IP TCP/MD5/IP TCP/IP
11 10 9 8
Speedup
Throughput (MBits/sec)
300
200
7 6 5 4
100
3 2 0
1 1
2
3
4
5
6
7
8
9
10
11
12
1
Processors
Figure 3.3 PLP Receive Thrpts.
2
3
4
5
6
7
8
9
10
11
12
Processors
Figure 3.4 PLP Receive Speedup
nection, barring any other protocol processing, and this is re ected by the baseline TCP/IP stack's minimal speedup. This is because time spent manipulating a TCP connection's state is large relative to the IP and FDDI processing and must occur inside a single locked, serial component. We saw that by using multiple connections, throughput can be improved. However, as more compute-intensive cryptographic protocols are used, while the throughput goes down, the relative speedup improves. For example, the MD5 stack achieves a speedup of 8 with 12 processors, and the DES and Triple-DES stacks produce very close to linear speedup. This is because the cost of cryptographic protocol processing, which outweighs the cost of TCP processing, occurs outside the scope of any locks. Figures 3.3 and 3.4 show the throughput and speedup respectively for the same stacks on the receive side. Again we observe successively lower throughputs as more compute intensive cryptographic protocols are used. Again, the more computeintensive the stack is, the better the speedup. We note that the speedup curve for 3-DES, which is the most compute-intensive, is essentially linear.
53
Table 3.1 PLP Latency Breakdown (sec)
Protocol Send Inc. Recv Inc. Stack Side Cost Side Cost TCP/IP 236 189 TCP/MD5/IP 1737 1500 1708 1519 DES/TCP/IP 7064 6833 7515 7326 DES/TCP/MD5/IP 8982 7846 9179 8990 3-DES/TCP/IP 21461 21225 21121 20932 Table 3.1 gives the latency breakdown for the various protocol stacks. Latency here is calculated from the uniprocessor throughput, to gain an understanding of the relative cost of adding cryptographic protocols. In our experiments, we de ne latency as the total time the network protocol code requires to process a packet. On the send side, latency is the total time between when a packet is sent at the top of the protocol stack and when the send function returns. It thus includes procedure call return time from after a packet is delivered to a device. Receive side latency is measured similarly, only the time is measured starting at the bottom of the protocol stack, and stops when the procedure has returned from delivering the packet to the top of the stack. The table includes total latency and incremental overhead for adding a protocol in addition to the baseline TCP/IP processing time. The incremental cost for MD5 is about 1500 sec, for DES about 7000 sec, and for 3-DES about 21000 sec. Since these costs are incurred outside the scope of any locks, they can run in parallel on dierent packets. Given that the locked component of manipulating the TCP connection-state limits the throughput to about 200 Mbits on this platform, we estimate that the TCP/MD5/IP stack would bottleneck at about 16 processors, and that the DES stack would scale to 30 processors. More compute-intensive protocols, such as RSA, should scale linearly as well within this range.
54
1800
12
TCP/IP TCP/MD5/IP DES/TCP/MD5/IP 3-DES/TCP/MD5/IP
1600
Ideal Linear Speedup 3-DES/TCP/MD5/IP DES/TCP/MD5/IP TCP/MD5/IP TCP/IP
10
1200
8
Speedup
Throughput (Mb/sec)
1400
1000
800
600
6
4
400 2 200
0
0 0
2
4
6
8
10
12
0
Processors
2
4
6
8
10
12
Processors
Figure 3.5 CLP Send Thrpts.
Figure 3.6 CLP Send Speedup
We also ran similarly con gured UDP-based stacks. In general, the results were similar, except that single-connection parallelism with the baseline UDP stacks exhibited much better speedup that the single-connection baseline TCP stacks. However, as cryptographic processing is used, the dierences in both throughput and speedup between TCP-based and UDP-based stacks essentially disappear.
3.7 Connection-Level Parallelism Figures 3.5 and 3.6 show send-side throughput and speedup respectively for our Connection-Level Parallel (CLP) experiments. In these experiments, 12 connections were measured, and the number of processors was varied from 1 to 12. The throughput graphs here show aggregate throughput for all 12 connections. As with packet-level-parallelism, the throughputs decline as more compute-intensive cryptographic operations are performed. Previous work [123] has shown good speedup for CLP, barring any other protocol processing. Figure 3.6 illustrates this in the baseline case, of the TCP/IP stack. CLP exhibits good speedup when there are at least as many active connections as processors, since each connection is processed in parallel, and little interaction occurs between processors. Figure 3.6 also shows that 55
1800
12
TCP/IP TCP/MD5/IP DES/TCP/MD5/IP 3-DES/TCP/MD5/IP
1600
Ideal Linear Speedup 3-DES/TCP/MD5/IP DES/TCP/MD5/IP TCP/MD5/IP TCP/IP
10
1200
8
Speedup
Throughput (Mb/sec)
1400
1000
800
600
6
4
400 2 200
0
0 0
2
4
6
8
10
12
0
Processors
Figure 3.7 CLP Receive Thrpts.
2
4
6
8
10
12
Processors
Figure 3.8 CLP Receive Speedup
Table 3.2 CLP Latency Breakdown (sec)
Protocol Send Inc. Recv Inc. Stack Side Cost Side Cost TCP/IP 226 214 TCP/MD5/IP 1772 1546 1769 1555 DES/TCP/IP 7332 7106 7365 7151
when cryptographic processing is added to the scenario, speedup becomes essentially linear. This is because speedup depends on the ratio of compute-bound vs. memory bound processing. The more compute-bound an application is, the better the speedup will be. The more memory-bound an application is, the greater the likelihood that contention between processors will limit speedup. Since cryptographic protocols are compute-intensive, stacks using them exhibit better speedup. Our receive-side experiments, shown in Figures 3.7 and 3.8, display the same trends. Table 3.2 gives the appropriate latency breakdown for the connection-level parallel stack. The incremental cost for MD5 in this case is about 1550 sec, and for DES about 7100 sec.
56
An interesting result is that while the two approaches to parallelism behave very dierently in the baseline cases (namely, the standard TCP/IP stacks), as more cryptographic processing is done, the schemes appear more similar in terms of both throughput and latency. For example, in the send-side experiment using DES, both schemes have a throughput of roughly 52 Mbits/sec with 12 processors. Again, this is because the cryptographic processing, which is similar in the two schemes, vastly outweighs the locking and context switching costs, which make up the dierence between the two approaches.
3.8 Discussion In this section we have shown how both packet-level and connection-level parallelism can be used to improve cryptographic protocol performance. We have not addressed functional parallelism or more ne-grained approaches to parallelized cryptography, such as using multiple processors to encrypt a single message in parallel. Such an approach would not only improve throughput, but might also reduce the latency as seen by an individual message. For example, DES in Electronic Code Book (ECB) mode can be run in parallel on dierent blocks of a single message. However, DES using ECB is susceptible to simple-substitution code attacks and cut-and-paste forgery, both of which are realistic worries in computer systems which send large amounts of known text. Thus, most DES implementations use CBC mode, where a plaintext block is XOR'ed with the ciphertext of the previous block, making each block dependent on the previous one, and preventing a parallelized implementation. However, each 8 byte block of a message encrypted with DES in CBC mode could be decrypted in parallel, since computing the plaintext block requires only the key, the ciphertext block, and the previous ciphertext block.
57
In practice, DES CBC must be used with some form of message integrity check to thwart cut-and-paste forgeries. MD5 is not amenable to ne-grain parallelism, and this limits the opportunities for applying our methods. Some avenues for research include nding faster or parallelizable message integrity algorithms, and combining these with DES modes that allow ner-grain parallel encryption techniques, and especially modes that allow the sender and receiver to use dierent processing granularities. An interesting question is whether an algorithm could be designed that could be run in parallel yet still have sucient cryptographic strength. Of course, parallelism is equally useful to an attacker, who can use multiple processors to speed a brute-force cracking attempt.
3.9 Conclusions We brie y summarize our ndings as follows:
Parallelism is an eective means of improving cryptographic performance, both using packet-level parallelism and connection-level parallelism.
Under both approaches, relative throughput declines as more compute-intensive protocols are used. On the other hand, speedup relative to the uniprocessor case improves.
In packet-level parallelism, speedup is essentially linear when DES or any more compute-intensive protocol is used.
In connection-level parallelism, speedup is essentially linear when MD5 or any more compute-intensive protocol is used. Both packet-level and connection-level parallelism are appropriate vehicles for servers that transfer large amounts of data securely. Due to the compute-bound 58
nature of cryptographic protocols, we observe good scalability for parallelized network security protocols, using both packet-level and connection-level parallelism.
59
CHAPTER 4 CACHE BEHAVIOR OF NETWORK PROTOCOLS 4.1 Introduction Cache behavior is a central issue in contemporary computer system performance. It is well-known that there is a large gap between CPU speeds and memory speeds, which is expected to continue for the foreseeable future [51]. Cache memories are used to bridge this gap, and multiple levels of cache memories have become typical. Many studies have examined the memory reference behavior of application code, and recently work has appeared studying the cache behavior of operating systems. However, little work has been done to date exploring the impact of memory reference behavior on network protocols. Given the presence of caches on both uniprocessor and multiprocessor servers, it is important to understand the the cache behavior of network protocols. Thus, rather than examining an application suite such as the SPEC 95 benchmarks, the workload that we study is network protocol software. We address the following research issues:
What is the memory reference behavior of network protocol code? What are the cache hit rates? How much time is spent waiting for memory?
What instructions do network protocols use? What percentage of time is taken by dierent classes of instructions?
How does data copying, which is frequently cited as a performance factor in network protocol processing, aect cache behavior? 60
Which has a more signi cant impact on performance, instruction references or data references?
How sensitive are network protocols to the cache organization? What are the impacts of factors such as cache size and associativity on performance?
What kind of impact will future architectural trends have on network protocol performance? We use execution-driven simulation to address these questions, by using an actual network protocol implementation that we run both on a real system and on a simulator. We have constructed a simulator for our MIPS R4400-based Silicon Graphics machines, and taken great eort to validate our simulator, i.e., to make sure that it models the performance costs of our platform accurately. We use the simulator to analyze a suite of Internet-based protocol stacks implemented in the x-kernel [58], which we ported to user space on our SGI machine. We characterize the behavior of network protocol processing, deriving statistics such as cache miss rates, instruction use, and percentage of time spent waiting for memory. We also determine how sensitive protocol processing is to the architectural environment, varying factors such as cache size and associativity, and we predict performance on future machines. We show that network protocol software is very sensitive to cache behavior, and quantify this sensitivity in terms of performance under various conditions. We nd that protocol memory reference behavior varies widely, and that instruction cache behavior has the greatest in uence on protocol performance in most scenarios. We show the upper bounds to performance that can be expected by improving memory behavior, and the impact of features such as associativity and larger cache sizes. In particular, we nd that TCP is more sensitive to cache behavior than UDP, gaining larger bene ts from improved associativity and bigger caches. We predict that net-
61
work protocol performance will scale with CPU speed over time, except with large packets on protocol architectures that copy data. The remainder of this Chapter is organized as follows: In Section 4.2 we outline related work. In Section 4.3 we describe our experimental environment, including protocols and the execution-driven simulator. In Section 4.4 we analyze a set of network protocol stacks. In Section 4.5 we show how sensitive network protocols are to architectural features such as cache size and associativity. In Section 4.6 we give an example of improving instruction cache behavior. In Section 4.7 we summarize our conclusions and discuss possible future work.
4.2 Related Work A number of researchers have addressed related issues in network protocol performance, involving architecture and memory system performance. In this section we outline their results and, as appropriate, relate their ndings to ours.
4.2.1 Closely Related Work Blackwell [13] also identi es instruction cache behavior as an important performance factor using traces of NetBSD on an Alpha. He proposes a technique for improving processing times for small messages, by processing batches of packets at each layer so as to maximize instruction cache behavior, and evaluates this technique via a simulation model of protocol processing. Clark et al. [31] provide an analysis of TCP processing overheads on an Intel i386 architecture circa 1988. Their analysis focuses on protocol-related processing, and does not address OS issues such as buering and copying data. Their argument is that TCP can support high bandwidths if implemented eciently, and that major sources of overhead are in data-touching operations such as copying and checksumming. They also note that instruction use of the protocols was essentially unchanged when moving 62
to an unspeci ed RISC architecture, and that this set is essentially a RISC set. They also focus on data memory references, assuming that instructions are in the cache. We have also focused on protocol-related issues, but on a contemporary RISC architecture, and have quanti ed the instruction usage. We have examined both instruction and data references, measured cache miss rates for both, and have explored the range of cache behavior. Jacobson [61] presents a high-performance TCP implementation that tries to minimize data memory references. He shows that by combining the packet checksum with the data copy, the checksum incurs little additional overhead since it is hidden in the memory latency of the copy. We have measured the cache miss rates of protocol stacks with and without both the copy and the checksum. Mosberger et al. [83] examine several compiler-related approaches to improving protocol latency. They present an updated study of protocol processing on a DEC Alpha, including a detailed analysis of instruction cache eectiveness. Using a combination of their techniques (outlining, cloning, and path-inlining), they show up to a 40 percent reduction in protocol processing times. Rosenblum et al. [104] present an execution-driven simulator that executes both application and operating system code. They evaluate scienti c, engineering, and software development workloads on their simulator. They conclude that emerging architectural features such as lockup-free caches, speculative execution, and out-oforder execution will maintain the current imbalance of CPU and memory speeds on uniprocessors. However, these techniques will not have the same eect on sharedmemory multiprocessors, and they claim that CPU/memory disparities will become even worse on future multiprocessors. Our workload, in contrast, is network protocol processing, and we have only examined uniprocessor behavior. Although we cannot evaluate some of the more advanced architectural features that they do, our
63
conclusions about our workload on future architectures agree with theirs, due to the increased cache sizes and associativities that are predicted for these machines. Salehi et al. [109] examine scheduling for parallelized network protocol processing via a simulation model parameterized by measurements of a UDP/IP protocol stack on a shared-memory multiprocessor. They nd that scheduling for cache anity can reduce protocol processing latency and improve throughput. Rather than using a model of protocol behavior, we use real protocols to drive a validated executiondriven simulator. We examine both TCP and UDP, determine both instruction and memory costs, and vary architectural dimensions to determine sensitivity. Speer et al. [115] describe pro le-based optimization (PBO), which uses pro les of previous executions of a program to determine how to reorganize code to reduce branch latencies and, to a lesser extent, cache misses. PBO reorders basic blocks to improve branch prediction accuracy and reorders procedures so that most frequent call chains are laid out contiguously to reduce instruction cache misses. They show that PBO can improve networking performance by up to 35 percent on an HP PARISC architecture when sending single-byte packets. Our work, in contrast, separates the bene ts of branch prediction from code repositioning, and shows that the latter has at least as much of an eect as the former. Much research has been done supporting high-speed network interfaces, both in the kernel and in user space [8, 14, 15, 33, 38, 39, 41, 80]. A common theme throughout this body of work is the desire to reduce the number of data copies as much as possible, as naive network protocol implementations can copy packet data as much as ve times. As a consequence, single-copy and even \zero-copy" protocol stacks have been demonstrated [28, 84]. This work focuses on `reducing work' done during protocol processing, namely reducing the number of instructions executed. Our protocol stacks emulate zero-copy and single-copy stacks. Our results not only measure the cache
64
miss rates and determine the architectural sensitivity, but also distinguish between instruction memory references and data memory references.
4.2.2 Less Closely Related Work Clark and Tennenhouse [32] propose Integrated Layer Processing (ILP) as a means to improve network protocol performance. ILP is designed to reduce memory trac by loading packet data into registers or cache memory and performing all protocol manipulation of that data before storing it back to main memory, rather than loading and storing the data at each layer. Braun and Diot [19] present a working implementation of Integrated Layer Processing (ILP). They conclude that the bene ts of ILP in a complete protocol environment are less than in simple standalone experiments, showing performance improvements in throughput and latency in the 10-20 percent range, rather than the 50 percent often quoted for simple loops. They show the main bene ts of ILP are reduced data memory accesses rather than improved cache hit rates. Our work does not relate to ILP directly, but supports the conclusions of Braun and Diot by illustrating the importance of instruction cache behavior. Ousterhout [92] shows that operating systems do not bene t as much as applications from faster hardware. He attributes this to the OS's larger use of both memory and disk bandwidth. Anderson et al. [4] look at the interaction between operating systems and computer architecture. They nd that many RISC processors do not provide ecient architectural support for certain OS primitives such as context switches, system calls, and page table changes. They observe that, at the same time, many operating systems are making more heavy use of those same primitives.
65
Chen and Bershad [27] present an environment for tracing operating system kernel memory references. They show that operating systems with a decomposed structure (i.e., \microkernels") exhibit worse cache behavior than traditional monolithic kernels.
4.3 Experimental Infrastructure In this section we describe our architectural simulator, discuss our experimental environment, and present validation results.
4.3.1 Architectural Simulator In order to investigate the memory reference behavior of network protocols, we have designed and implemented an architectural simulator for our 100 MHz R4400based SGI machine. We use this simulator to understand the performance costs of our network protocol stacks, and to guide us in identifying and reducing bottlenecks. The primary goal of the simulator has been to accurately model CPU and memory costs for the SGI architecture. Our architectural simulator is built using MINT [122], a toolkit for implementing multiprocessor memory reference simulators. MINT interprets a compiled binary directly and executes it, albeit much more slowly than if the binary was run on the native machine. This process is called direct execution. MINT is designed to simulate MIPS-based multiprocessors, such as our SGI machine, and has support for the multiprocessor features of IRIX. Unlike some other simulators, it does not require all source for the application to be available, and does not require changing the application for use in the simulator. This means the exact same binary is used on both the actual machine and in the simulator. A simulator built using MINT consists of 2 components: a front end, provided by MINT, which handles the interpretation and execution of the binary, and a back-end, supplied by the user, that maintains the state of the cache and provides the timing 66
CPU Instr
Data
Main Memory
Network I/O Board
Unified
Memory Bus
Figure 4.1 Machine Organization properties that are used to emulate a target architecture. The front end is usually called a trace generator, and the back end a trace consumer. On each memory reference, the front end invokes the back end, passing the appropriate memory address. Based on its internal state, the back end returns a value to the front end telling it whether to continue (for example, on a cache hit) or to stall (on a cache miss). We have designed and implemented a back end for use with MINT to construct a uniprocessor simulator for our 100 MHz R4400-based SGI Challenge. Figure 4.1 shows the memory organization for this machine. The R4400 has separate 16 KB direct-mapped on-chip rst level instruction and data caches with a line size of 16 bytes. Our SGI machine also has a 1 MB second-level direct-mapped on-board uni ed cache with a line size of 128 bytes. The simulator captures the cost of the important performance characteristics of the SGI platform. It supports multiple levels of cache hierarchy, including the inclusion property for multi-level caches, and models the aspects of the MIPS R4400 processor that have a statistically signi cant impact on performance, such as branch delays and load delay pipeline interlocks. It does not, however, capture TLB behavior1 . An earlier version of our simulator did model the TLB, but we found that the impact on accuracy was negligible, and that the execution time of the simulator was tripled. 1
67
4.3.2 Network Protocol Workload The experimental environment we use here is similar to that described in Chapters 2 and 3. The main dierence is that a uniprocessor version of the x-kernel is used. This is the uniprocessor base for two dierent multiprocessor versions of the x-kernel: The PLP implementation described in Chaper 2 and the CLP implemenation described in [124]. Another dierence is that we focus on average latency rather than throughput as a performance metric. This is because latency is better suited to illustrate the eects that dierent factors such as CPU and memory have on performance. In our experiments, we de ne latency as the total time the network protocol code requires to process a packet. On the send side, latency is the total time between when a packet is sent at the top of the protocol stack and when the send function returns. It thus includes procedure call return time from after a packet is delivered to a device. Receive side latency is measured similarly, only the time is measured starting at the bottom of the protocol stack, and stops when the procedure has returned from delivering the packet to the top of the stack. In our experiments, the reported times for latencies are the average of ten runs, where each run in turn measures the average latency observed over a 5 second sampling interval after a 5 second warmup. During these intervals, other processing can occasionally occur, such as the x-Kernel's periodic event manager which runs every 50 milliseconds, or the TCP 200 millisecond fast timer. However, we have observed the times for these other events to be statistically insigni cant in our experiments.
4.3.3 Validating the Simulator In order to validate the performance accuracy of the simulator, a number of benchmarks were run on both the real and simulated machines. We used the memory striding benchmarks from LMBench [78] to measure the cache hit and miss latencies for 68
Predator (SGI IP19) READ times, 100MHz R4400 (10 ns clock), 16KB Split I/D 1st−level cache, 1MB 2nd−level unified Direct−mapped, write−back Stride = 128 Stride = 64 Stride = 32 Stride = 16 Stride = 8 Stride = 4
Stride = 128 Stride = 64 Stride = 32 Stride = 16 Stride = 8 Stride = 4
1000
Latency in Nanoseconds
1000
Latency in Nanoseconds
Simulated Predator (SGI IP19) READ times, 100MHz R4400 (10 ns clock), 16KB Split I/D 1st−level cache, 1MB 2nd−level unified Direct−mapped, write−back
100
10
100
10
100
101
102
103
104
100
101
Array Size in KB
102
103
104
Array Size in KB
Figure 4.2 Actual Read Times
Figure 4.3 Simulated Read Times
Table 4.1 Read and write times in cycles Layer in Read Write Hierarchy time time L1 Cache 0 0 L2 Cache 11 11 Challenge Bus 141 147
all three levels of the memory hierarchy: L1, L2, and main memory. Table 4.1 lists the cycle times to read and write the caches on the 100MHz SGI Challenge (IP19). Figure 4.2 shows the LMBench read memory access time as a function of the area walked by the stride benchmark, as run on our 100 MHz R4400 SGI Challenge. We call this graph a memory signature. The memory signature illustrates the access times of the rst level cache, the second-level cache, and main memory. When the area walked by the benchmark ts within the rst level cache (i.e., is 16 KB or less), reading a byte in the area results in a rst-level cache hit and takes 20 nanoseconds. When the area ts within the second level cache (i.e., is between 16 KB and 1 MB in size), reading a byte results in a second-level cache hit and takes 134 nanoseconds. If the area is larger than 1 MB, main memory is accessed, and the time to read a byte
69
Table 4.2 Macro benchmark times (sec) and relative error Benchmark Simulated Real Error (%) TCP Send, cksum o 76.63 78.58 2.48 TCP Send, cksum on 147.84 146.66 -0.81 UDP Send, cksum o 18.43 15.97 -15.40 UDP Send, cksum on 71.99 70.30 -2.41 TCP Recv, cksum o 58.06 62.65 7.33 TCP Recv, cksum on 190.47 198.39 3.99 UDP Recv, cksum o 33.80 32.84 -2.95 UDP Recv, cksum on 161.78 158.84 -1.85 Average Error 4.65
is 1440 nanoseconds. Note that the scales in Figure 4.2 are logarithmic in both the x and y axes. These memory latency measurements in turn gave us values with which to parameterize the architectural simulator. The same memory stride programs were then run in the simulator, to ensure that the simulated numbers agreed with those from the real system. Figure 4.3 shows the memory signature of the same binary being run on the simulator for the same machine. As can be seen, the simulator models the cache memory behavior very closely. While reassuring, these memory micro-benchmarks do not stress other overheads such as instruction costs. What we are most interested in is how accurate our simulator is on our workload, namely, network protocol processing. Table 4.2 presents a set of protocol processing benchmarks, with their corresponding real and simulated latencies in microseconds, and the relative error. Error is de ned as
V alue ? Real V alue) 100 Error = (SimulatedReal V alue A negative error means the simulator underestimates the real time; a positive value means it overestimates the real time. The average error is calculated as the mean of the absolute values of the individual errors. This is to prevent positive and 70
negative individual values from canceling each other out. Note the average error is under 5 percent, with the worst case error being about 15 percent. We are aware of only a very few pieces of work that use trace-driven or execution driven simulation that actually validate their simulators [10, 25, 36]. Our accuracy is comparable to theirs. More details about the construction and validation of the simulator can be found in the appendix.
4.4 Characterization and Analysis In this Section, we present our characterization and analysis of memory reference behavior of network protocols under a number of dierent conditions.
4.4.1 Baseline Memory Analysis We begin by determining the contribution to packet latency that is due to waiting for memory. Our baseline latencies are produced by executing the core suite of protocols on the architectural simulator, examining TCP and UDP, send and receive side, with checksumming on and o. Figure 4.4 shows the latencies in microseconds, distinguishing time spent in computation (CPU time) from time spent waiting for memory (memory time). Figure 4.5 shows the relative percentages of time the protocols spend waiting for memory. Table 4.3 gives the corresponding cache miss rates for the L1 instruction cache, L1 data cache, and L2 uni ed cache for each con guration, where miss rate is the fraction of accesses that are not in the cache [51]. Studying the data in more detail, we see that all the con gurations spend time waiting for memory, ranging from 16 to nearly 57 percent. TCP generally spends a larger fraction of time than UDP waiting for memory, and receivers are slightly more memory bound than senders. UDP generally exhibits low miss rates for all the cache levels, while TCP tends to have higher miss rates for the corresponding experiments, particularly in the data cache on the send side. Experiments that include checksum71
CPU Time
Memory
180.00
90
160.00
80
140.00
70
50
Figure 4.4 Baseline Latencies
UDP Recv Cksum On
UDP Recv Cksum Off
TCP Send Cksum Off
UDP Recv Cksum On
UDP Recv Cksum Off
UDP Send Cksum On
UDP Send Cksum Off
0
TCP Recv Cksum On
10
0.00 TCP Recv Cksum Off
20
20.00 TCP Send Cksum On
30
40.00
UDP Send Cksum On
40
60.00
UDP Send Cksum Off
80.00
CPU
60
TCP Recv Cksum On
100.00
TCP Recv Cksum Off
120.00
TCP Send Cksum On
Percentage
100
TCP Send Cksum Off
Latency in usec
Memory Time 200.00
Figure 4.5 Baseline Percentages
Table 4.3 Cache Miss Rates for Baseline Protocols Protocol Level 1 Level 1 Level 2 Con guration Instr Data Uni ed TCP Send Cksum O 8.30% 5.90% 0.00% TCP Send Cksum On 3.60% 7.60% 0.00% TCP Recv Cksum O 7.00% 2.80% 0.00% TCP Recv Cksum On 2.80% 15.60% 5.70% UDP Send Cksum O 4.00% 0.30% 6.70% UDP Send Cksum On 1.10% 1.10% 2.50% UDP Recv Cksum O 4.70% 1.60% 2.60% UDP Recv Cksum On 1.50% 17.80% 9.20%
72
ming generally show lower instruction cache miss rates, since the checksum code is an unrolled loop, and thus exhibits higher temporal locality. The checksum code also does a good job of hiding memory latency since the unrolling allows checksum computation to overlap with load instructions.
4.4.2 Impact of Copying Data Our baseline results are for the case where there is no copying of data between buers or across address spaces. Avoiding or eliminating copies is a well-known technique for improving network protocol performance [8, 14, 15, 33, 38, 41, 80]. In certain cases, however, this copy may be unavoidable, due to insucient operating system or device support. In this section we evaluate how copying data aects network protocol performance. Our protocols run in user space, and do not incur any additional overhead other than the raw protocol processing time. To examine how copying data aects performance, we ran our protocols with a copy introduced at the top of the stack, to capture the cost of mechanisms such as a socket buer copy across the user/kernel boundary. The latencies are presented in Figure 4.6, again distinguishing CPU time from memory time. Figure 4.7 shows the relative percentages of time the protocols spend waiting for memory. The corresponding cache miss rates are summarized in Table 4.4. In these examples, we used a 4K packet size, and provoked a worst-case cache behavior by forcing the copy buer to not be cached. Obviously, whether or not a particular buer is cached in a particular scenario depends on many factors, such as the application behavior. Several trends are apparent. We see that latency increases by a factor of 27, depending on the scenario, mainly due to the increased time spent waiting for memory. In all cases, con gurations with a copy have better instruction cache miss 73
Memory Time
CPU Time
Memory
350.00
CPU
100 90
300.00
70 Percentage
Latency in usec
80
250.00 200.00 150.00
60 50 40 30
100.00
20
50.00
10
Figure 4.7 Copy Protocol %
Table 4.4 Cache Miss Rates for Protocols with Copying Protocol Level 1 Con guration Instr TCP Send Cksum O 4.10% TCP Send Cksum On 2.60% TCP Recv Cksum O 3.40% TCP Recv Cksum On 2.20% UDP Send Cksum O 0.80% UDP Send Cksum On 0.70% UDP Recv Cksum O 1.70% UDP Recv Cksum On 1.10%
74
Level 1 Level 2 Data Uni ed 15.60% 3.40% 13.20% 3.00% 20.10% 5.80% 16.60% 5.40% 14.70% 7.50% 10.20% 6.90% 21.50% 7.20% 16.60% 6.90%
UDP Recv Cksum On
UDP Recv Cksum Off
UDP Send Cksum On
UDP Send Cksum Off
TCP Recv Cksum On
TCP Recv Cksum Off
TCP Send Cksum Off
UDP Recv Cksum On
UDP Recv Cksum Off
UDP Send Cksum On
UDP Send Cksum Off
TCP Recv Cksum On
TCP Recv Cksum Off
TCP Send Cksum On
TCP Send Cksum Off
Figure 4.6 Copy Protocol Times
TCP Send Cksum On
0
0.00
rates than those without the copy (due to the use of the bcopy() routine), and have worse data cache miss rates (due to buers not being present in the cache). The time to copy data can approach twice that of all TCP protocol processing, and up to 6 times that of UDP! We stress that these are worst-case scenarios. These experiments coerce poor cache behavior, which normally depends on application as well as protocol behavior, particularly on how applications manage data. These scenarios also use large packet sizes, and researchers have observed that average packet sizes are small (see, for example, [13, 65]). In one send-side TCP experiment that we performed, for example, that used 64 byte packet sizes, introducing a copy led to only a 9 percent increase in latency. Our experiments that use copies should thus be considered as worst-base scenarios for this particular platform. For the remainder of this chapter, unless otherwise speci ed, all results are for the protocol stacks without the added copy.
4.4.3 Hot vs. Cold Caches The experiments we have presented thus far have involved hot caches, where successive packets bene t from the resulting state left by their predecessors. However, the state of the cache can vary depending on application and operating system behavior. For example, when a packet arrives at an idle machine, it is not certain whether the network protocol code or data will be cache-resident. To examine the impact of the cache state on network protocol performance, we ran additional experiments that measure the extremes of cache behavior, using cold caches and idealized caches. In experiments with cold caches, after processing each packet, the cache in the simulator is ushed2 of all contents. Each packet is thus processed with no locality 2
This ush takes 0 cycles of simulated time.
75
Table 4.5 Latencies (in sec) with Cold, Hot, and Idealized Caches Protocol Con guration Cold Hot Ideal TCP Send Cksum O 375 77 42 TCP Send Cksum O w/COPY 549 201 74 UDP Send Cksum O 123 18 12 UDP Send Cksum O w/COPY 301 132 45
bene ts from the previous packet. Cold cache experiments thus measure the potential worst-case memory behavior of protocol processing. In experiments with idealized caches, the assumption is made that all references to the level 1 caches hit; i.e., no gap between memory and CPU speeds exists. However, the processor model remains the same, as described in Section 4.3.1. The idealized cache experiments thus give us an unrealizable best-case behavior, and provide upper bounds on performance when focusing solely on the memory behavior of network protocols. Table 4.5 gives a sample of results, in this case for the UDP and TCP send sides with and without copying. In general, we observe a factor of 5 to 6 increase in latency between experiments with hot caches and those with cold caches. Our experiments using UDP exhibit a slowdown by a factor of 6, which is similar to the measurements by Salehi et al. [109], who observed a slowdown of a factor of 4 when coercing coldcache behavior with UDP without checksumming. However, in experiments using TCP, which they did not examine, we see that latencies increase by a factor of 5. This is because TCP exhibits relatively better instruction cache miss rates than UDP in the cold cache scenario. Since TCP does more instruction processing per packet than UDP, TCP bene ts from the implicit instruction prefetching due to line sizes greater than one word. This improves the instruction cache miss rate relative to UDP. In other words, TCP exhibits better spatial locality.
76
Table 4.6 Cold Cache Miss Rates
Protocol Con guration TCP Send Cksum O TCP Send Cksum O w/Copy UDP Send Cksum O UDP Send Cksum O w/Copy
Level 1 Instr 20.90% 10.40% 23.10% 4.90%
Level 1 Data 18.50% 22.70% 19.90% 24.40%
Level 2 Uni ed 21.30% 16.80% 28.60% 16.20%
We also saw that cold cache experiments using checksumming did not suer as much relative slowdown compared with the hot cache equivalents as those that did not use checksumming. This was due to better instruction cache behavior, again since the checksum code exhibits both high temporal and spatial locality. Interestingly, in cold cache experiments that used protocols with copies added, the relative slowdown due to the copy was only about 50 percent. This is because adding a copy usually provokes poor cache behavior, but in the cold-cache experiment, cache behavior is already worst-case. Thus, adding a copy in the cold-cache case makes a relatively smaller dierence in latency (compared to adding a copy in the warm-cache case). Table 4.6 presents the cache miss rates for a sample of cold cache experiments. In general, the miss rate goes up by a factor of 2-5. It is interesting to note that despite the initial cold state of the caches, miss rates are still under 25 percent.
4.4.4 Instructions vs. Data As discussed in Section 4.4.2, much of the literature on network protocol performance has focused on reducing the number of copies, since touching the data is expensive. However, this work has not made explicit how much of this cost is due to data references as opposed to instruction references. For example, a copy routine on a RISC platform incurs at least one instruction reference for every data reference, 77
namely, the instruction that loads or stores a piece of data. We therefore want to understand which cost has a larger impact on performance: instruction references or data references. To test which type of references are more signi cant, we implemented two additional simulators. The rst was an ideal d-cache simulator, where data references always hit in the L1 data cache, but instructions are fetched normally from the instruction cache; thus, there are no data cache misses. The second was a complementary ideal i-cache simulator, where there are no instruction cache misses, but data references are fetched normally from the d-cache. While neither of these simulators represents an attainable machine, they do provide a way of distinguishing instruction costs from data costs, and yield insight into where network protocol developers should focus their eorts for the greatest bene t. In all of our experiments, the raw number of instruction references exceeds that of data references. In general, tests without checksumming exhibit a roughly 2:1 instruction:data ratio, and experiments with checksumming had a 3:1 ratio. This is consistent between TCP and UDP, and between the send side and receive side. In most of the experiments, instruction references also outweigh data references in terms of their impact on latency. The exception is for protocol architectures that copy data where packets are large. In these experiments, the d-cache was more signi cant than the i-cache. Figure 4.8 gives an example of the results, for the TCP and UDP send sides. The columns marked `D-Cache' are the times for the idealized data cache, and the columns marked `I-Cache' contain results using the idealized instruction cache. Our results indicate that the performance impact on network protocols of the CPU-memory disparity is felt more through instruction references than through data references.
78
Base
D-Cache
I-Cache
Ideal
160 140
Latency in usec
120 100 80 60 40 20
UDP Send Cksum On
UDP Send Cksum Off
TCP Send Cksum On
TCP Send Cksum Off
0
Figure 4.8 Send Side Latencies In another test, we tried to provoke pathologically bad data cache behavior by varying the number of active connections from 1 to 256. Throughout the experiment, connections are serviced round robin, and push dierent buers through the stack. Figure 4.9 gives an example of a send side TCP test with checksumming on, and distinguishes CPU time from memory time. As can be seen, latency goes up slightly as a function of the active number of connections, with all the additional overhead coming from worse data-cache behavior. However, the overhead is not as large as might be expected. This is, again, because instruction memory referencing behavior has a larger impact on performance.
4.4.5 Instruction Usage Thus far, we have examined how memory issues aect network protocol latency. However, given that in some of our experiments up to 68 percent of the cycles can be attributed to factors other than the memory system (i.e., the CPU), it is worthwhile to examine in detail which instructions network protocol software uses and how much those instructions contribute to the overall latency. This will also give us indications 79
Memory Time 250
CPU Time
Latency in usec
200
150
100
50
0 1
2
4
8
16
32
64
128 256
Figure 4.9 TCP Send Side Latencies 100.00 90.00
Arithmetic
80.00 Percentage
70.00
Nop
60.00 50.00
Control
40.00 30.00 Stores
20.00 10.00
Loads UDP Recv Cksum On
UDP Recv Cksum Off
UDP Send Cksum On
UDP Send Cksum Off
TCP Recv Cksum On
TCP Recv Cksum Off
TCP Send Cksum On
TCP Send Cksum Off
0.00
Figure 4.10 Instruction Usage about the ecacy of possible optimizations, such as, for example, improved branch prediction. Figure 4.10 shows the instruction usage for each protocol benchmark, by listing the percentage of instructions used by category. All of our test cases used essentially 5 types of instructions: loads, stores, controls, arithmetics, and no-ops. Control instructions are those that can change the control ow of a program, i.e., both conditional control (conditional branches) and absolute control (jumps). On the MIPS architecture, no-ops are inserted after a branch or load instruction if the delay slot 80
100.00
Memory
90.00 80.00
Arithmetic
Percentage
70.00 Nop
60.00 50.00
Control
40.00 30.00
Stores
20.00 10.00
Loads UDP Recv Cksum On
UDP Recv Cksum Off
UDP Send Cksum On
UDP Send Cksum Off
TCP Recv Cksum On
TCP Recv Cksum Off
TCP Send Cksum On
TCP Send Cksum Off
0.00
Figure 4.11 Percentage of Cycles cannot be lled with an instruction that does useful work. Arithmetic instructions are those than perform some sort of arithmetic operation on registers, including adds, subtracts, shifts, and logicals. Although all our test cases used these classes of instructions, their relative proportions varied depending on the con guration. Figure 4.11 details where time is spent in percentage of cycles by that instruction class and waiting for memory. Figure 4.12 presents the latencies of the benchmarks, broken down by percentage of cycles. Note that the percentage of the number of instructions used is not necessarily related to the percentage of the latency that the instructions contribute to. For example, in the TCP send side test with checksumming, arithmetic operations make up about 58 percent the instructions executed, but only account for 32 percent of the total time. One interesting observation is that instructions changing control ow (branches and jumps) account for between 4 and 19 percent of the total time used. If no-ops are included as control instructions, this percentage rises to the 8-23 percent range. This implies that, even if we were to have single-cycle control instructions (or perfect branch prediction), the improvement in latency would be limited to a fraction of that. For example, in the UDP receive-side test, control instructions constitute 16 percent 81
200.00
Memory
180.00
Latency in usec
160.00
Arithmetic
140.00 120.00
Nop
100.00 Control
80.00 60.00
Stores
40.00 20.00
Loads UDP Recv Cksum On
UDP Recv Cksum Off
UDP Send Cksum On
UDP Send Cksum Off
TCP Recv Cksum On
TCP Recv Cksum Off
TCP Send Cksum On
TCP Send Cksum Off
0.00
Figure 4.12 Latencies with Cycle Breakdown of the instructions (26 percent including no-ops), and 19 percent of the total time in cycles (23 percent including no-ops). Since an idealized branch instruction would cost 1 cycle, in this example control instructions would still contribute roughly 10 percent of the total cycles, the improvement in latency would at best only be between nine and fourteen percent. This is in contrast to the results of Speer et al., who observed a 35 percent improvement in network protocol latency by using pro le-based optimization. Their paper claims that this performance gain is mostly due to improved branch prediction, and to a lesser extent, better instruction cache behavior. Our ndings, however, indicate that the change in instruction cache behavior is a signi cant contributing factor in their results. We evaluate the use of pro le-based optimization on code placement in Section 4.6; with it, we observe performance improvements of up to 40 percent. We should note, however, that Speer et al. used a dierent architecture (an HP PA-RISC) with a dierent networking stack (BSD-based), so a direct comparison of results is not possible. Obviously, for architectures with worse branch penalties, accurate branch prediction methods will result in correspondingly better performance improvements.
82
Hennessey and Patterson [51] note that as CPUs issue more instructions each cycle, branch performance becomes even more important.
4.5 Architectural Sensitivity In this section we explore architectural variations in several dimensions, in order to determine how sensitive our protocol performance is to the host architecture, and to determine how protocol performance might be expected to change with the introduction of new architectures.
4.5.1 Increased Cache Size One trend in emerging processors is increasing transistor counts, which has led to increasing on-chip cache sizes. For example, the MIPS R10000 has 32 KB on-chip caches. As we described earlier, our SGI platform has 16 KB rst level caches. While in certain cases (typically the UDP experiments), this gives reasonable miss rates, it is useful to see how sensitive network protocol performance is to the size of the cache. Larger caches, for example, may allow the entire working set of the protocol stack to t completely in the cache. To evaluate how sensitive network protocol performance is to cache size, we ran a set of experiments varying the rst level cache sizes from 8 KB up to 128 KB in powers of two. The level 2 uni ed cache was left at 1 MB. Figure 4.13 presents the latencies for our protocol con gurations as a function of the rst level cache size. We can see that increased cache size results in reduced latency, and that TCP is more sensitive to the cache size than UDP. The largest gains comes from increasing the level 1 cache sizes up to 32 KB, with diminishing improvements after that. Figure 4.14 gives an example in detail, showing TCP send-side latency with checksumming o as a function of cache size, again distinguishing between CPU time and
83
8K
16K
32K
64K
128K
250
Latency in usec
200
150
100
50
UDP Recv Cksum On
UDP Recv Cksum Off
UDP Send Cksum On
UDP Send Cksum Off
TCP Recv Cksum On
TCP Recv Cksum Off
TCP Send Cksum On
TCP Send Cksum Off
0
Figure 4.13 Latencies with Increasing Cache Size
Memory Time
100 90
CPU Time
Latency in usec
80 70 60 50 40 30 20 10 0 8 KB
16 KB
32 KB
64 KB
128 KB
Figure 4.14 TCP Send Side Latency with Larger Caches 84
Table 4.7 Miss Rates vs. Cache Size (TCP Send, Cksum O) Level 1 Level 1 Level 1 Level 2 Cache Size Instr Data Uni ed 8 KB 14.40% 7.10% 0.00% 16 KB 8.30% 5.90% 0.00% 32 KB 4.10% 1.90% 0.00% 64 KB 1.90% 1.50% 0.00% 128 KB 0.20% 0.80% 0.00%
memory time. We can see that the reduction in latency is due to less time spent waiting for memory. Table 4.7 presents the corresponding cache miss rates. We see that both the instruction and data cache miss rates improve as the size increases to 128 KB, but that the change in the instruction cache miss rate is more dramatic.
4.5.2 Increased Cache Associativity As mentioned earlier, the block placement algorithm for the caches in our SGI machine is direct-mapped, both for the rst and second levels. This means that an item loaded into the cache can only be placed into a single location, usually based on a subset of its virtual address bits. If two \hot" items happen to map to the same location, the cache will thrash pathologically. In contrast, TLBs and virtual memory systems are usually fully associative. Cache memories have historically been direct mapped because adding associativity has tended to increase the critical path length and thus increase cycle time [54]. While most machines today have direct-mapped caches, emerging machines, such as the MIPS R10000, are starting to have 2 way set-associative on-chip caches. It is thus useful to assess the impact of improved associativity on network protocol performance. To test this, we ran several experiments varying the associativity from
85
1
2
4
8
200 180
Latency in usec
160 140 120 100 80 60 40 20 UDP Recv Cksum On
UDP Recv Cksum Off
UDP Send Cksum On
UDP Send Cksum Off
TCP Recv Cksum On
TCP Recv Cksum Off
TCP Send Cksum On
TCP Send Cksum Off
0
Figure 4.15 Protocol Latencies with Associativity 1 to 8 in powers of 2. While 8-way set-associative on-chip caches are unusual3 , it is illustrative of how memory time was being spent. For example, it allows us to estimate how much of memory time is due to con icts in the cache rather than capacity problems [55]. Figure 4.15 presents the protocol latencies as associativity is varied. In these experiments, for simplicity, all caches in the system have the same associativity, e.g., an experiment marked with 2 indicates that the instruction cache, the data cache, and the level 2 uni ed cache all have 2-way set associativity. We can see that TCP exhibits better latency as associativity is increased all the way up to 8. Figure 4.16 gives an example in detail, showing TCP send-side latency with checksumming o as a function of set associativity, again distinguishing between CPU time and memory time. We can see that the reduction in latency is due to a decrease in the time spent 3
The PowerPC 620 has 8-way on-chip set-associative caches.
86
80
Memory Time
70
CPU Time
Latency in usec
60 50 40 30 20 10 0 1
2
4
8
Figure 4.16 TCP Send Side Latency with Associativity Table 4.8 TCP Miss Rates vs. Associativity (Cksum O) Protocol Level 1 Level 1 Level 2 Con guration Instr Data Uni ed 1 8.30% 5.90% 0.00% 2 5.30% 0.40% 0.00% 4 1.70% 0.00% 0.00% 8 0.50% 0.00% 0.00%
waiting for memory. Table 4.8 presents the corresponding cache miss rates. We see that the data cache achieves close to zero misses with 2 way set-associativity, but that the instruction cache miss rates improve all the way up to 8-way. This implies that the Berkeley-derived TCP code has con icts on the fast path, and that restructuring the code for better cache behavior promises performance improvements. We present one example of this restructuring in Section 4.6. In contrast, we do not observe any performance gains for UDP beyond 2 way set-associativity. This is because with 2-way set-associative caches, the UDP stacks achieve very close to zero misses, i.e., they can t completely within the cache. This shows that UDP has fewer con icts than TCP, and implies that the opportunity for improving UDP cache behavior is smaller than that for TCP. 87
Table 4.9 Machine Characteristics
Machine (year) 1994 1996 1998 Clock Speed (MHz) 100 200 400 L1 Cache Size (KB) 16 32 64 L1 Associativity 1 2 2 L1 Rd/Wr time (cycles) 0/0 0/0 0/0 L2 Cache Size (KB) 1024 1024 1024 L2 Associativity 1 2 2 L2 Rd/Wr time (cycles) 11/11 13/13 16/16 Memory Rd/Wr time (cycles) 141/147 201/275 300/400
4.5.3 Future Architectures Given that CPU's are roughly doubling in performance every 2 years, we would like to gain an understanding of how future architectural trends will impact network protocol performance. We have seen that both increased associativity and larger cache sizes improve latency. However, these previous experiments have held the clock speed and miss penalties constant, which ignores two signi cant trends in computer architecture. First, processor clock speeds are increasing, and second, the gap between memory speeds and CPU speeds is growing. To gain a better understanding of how network protocol workloads might behave on future architectures, we compared the performance of our stacks on 3 dierent virtual machines, representing characteristics of 1994, 1996, and 1998, respectively. The 1994 machine is our baseline case, described earlier. The 1996 machine has a faster clock and larger on-chip caches with 2 way set-associativity. It also has larger miss penalties, the values of which we take from the newer SGI IP22 Indigo/2 workstations with a 200 MHz clock. The 1998 machine is an extrapolation of the cycle time and latency trends from the 1994 to the 1996 machine. Table 4.9 gives the relevant parameters for the 3 machines.
88
Table 4.10 gives the latencies of several protocol con gurations being run on the 3 machines. As can be seen, latencies fall as CPU's get faster. However, the issue is, how does network protocol processing scale with processor speed? For that answer, we must normalize by the clock speed and look at the cycles per instruction, or CPI. CPI is a standard measure of architectural performance; an idealized architecture will have a CPI of one4 . Table 4.11 gives the relative CPI's for the same set of experiments. In general, we see that the CPI falls as processors get faster. This is because the workload starts to t within the caches and run at essentially the processor speed. The exception is for protocol stacks that copy buers using large messages. In these cases, the CPI rises, indicating that protocol processing time is not keeping up with CPU speed. Even in the copy experiments using relatively small 64 byte messages, we see that CPI still increases slightly over time. This means avoiding copies will become even more important in future machines. Several cold cache experiments are also listed in Tables 4.10 and 4.11. We see that the penalty for a cold cache becomes even worse on future machines.
4.6 Improving I-Cache Performance with Cord In this section we examine the ip side of hardware-software interaction: tuning or changing the software to take better advantage of the hardware. In this Chapter, we have advocated techniques that improve instruction cache behavior. Mosberger et al. [83] and Blackwell [13] provide two examples of how this can be done. Mosberger et al. examine several compiler-related approaches to improving protocol latency. Using a combination of their techniques (outlining, cloning, and path-inlining), they show up to a 40 percent reduction in protocol processing times. Blackwell [13] also identi es instruction cache behavior as an important performance factor using traces of NetBSD. He proposes a technique for improving processing 4
Assuming a single-issue processor.
89
Table 4.10 Machine Latencies (sec)
Protocol Con guration TCP Send COPY OFF Size 4096 UDP Send COPY OFF Size 4096 TCP Send COPY ON Size 16 TCP Send COPY ON Size 64 TCP Send COPY ON Size 256 TCP Send COPY ON Size 1024 TCP Send COPY ON Size 4096 UDP Send COPY ON Size 16 UDP Send COPY ON Size 64 UDP Send COPY ON Size 256 UDP Send COPY ON Size 1024 UDP Send COPY ON Size 4096 COLD TCP Send Cksum O COLD TCP Send Cksum O COPY ON COLD UDP Send Cksum O COLD UDP Send Cksum O COPY ON
1994 1996 1998 76.61 23.91 8.83 18.43 6.26 2.50 22.98 9.06 3.29 20.99 9.28 3.78 30.74 13.61 5.84 68.89 30.38 13.80 201.22 91.51 42.57 25.25 9.04 3.64 26.42 9.76 4.03 29.96 10.56 4.70 45.63 22.39 10.89 133.91 70.10 35.73 375.36 247.13 139.20 549.63 357.47 198.68 123.81 83.35 47.39 301.00 195.98 108.25
Table 4.11 Machine CPIs
Protocol Con guration 1994 1996 1998 TCP Send COPY OFF Size 4096 2.57 1.58 1.45 UDP Send COPY OFF Size 4096 2.24 1.42 1.42 TCP Send COPY ON Size 0016 2.16 1.65 1.48 TCP Send COPY ON Size 0064 1.81 1.57 1.61 TCP Send COPY ON Size 0256 2.11 1.85 2.00 TCP Send COPY ON Size 1024 2.70 2.37 2.70 TCP Send COPY ON Size 4096 3.38 3.07 3.58 UDP Send COPY ON Size 0016 2.07 1.42 1.44 UDP Send COPY ON Size 0064 2.11 1.51 1.57 UDP Send COPY ON Size 0256 2.49 1.97 2.23 UDP Send COPY ON Size 1024 2.88 2.82 3.46 UDP Send COPY ON Size 4096 3.49 3.71 4.75 COLD TCP Send Cksum O 12.87 16.97 23.93 COLD TCP Send Cksum O COPY ON 9.28 12.09 16.82 COLD UDP Send Cksum O 16.72 22.62 32.28 COLD UDP Send Cksum O COPY ON 8.03 10.47 14.49
90
Table 4.12 Baseline & CORDed Protocol Latencies (sec.) Protocol Original CORD Di Con guration time (%) TCP Send Cksum O 76.61 72.61 5 TCP Send Cksum On 147.80 148.38 0 TCP Recv Cksum O 58.04 54.33 6 TCP Recv Cksum On 190.42 186.61 2 UDP Send Cksum O 20.92 12.53 40 UDP Send Cksum On 77.45 65.59 15 UDP Recv Cksum O 33.46 27.03 19 UDP Recv Cksum On 160.15 148.76 7
times for small messages, by processing batches of packets at each layer so as to maximize instruction cache behavior, and evaluates this technique via a simulation model of protocol processing. In this section we evaluate another technique: improving instruction cache behavior using CORD [113]. CORD is a binary re-writing tool that uses pro le-guided code positioning [96] to reorganize executables for better instruction cache behavior. An original executable is run through Pixie [114] to determine its run time behavior and pro le which procedures are used most frequently. CORD uses this information to re-link the executable so that procedures used most frequently are grouped together. This heuristic approach attempts to minimize the likelihood that \hot" procedures will con ict in the instruction cache. We ran our suite of network protocol benchmarks through Pixie and CORD to produce CORDed equivalent executables. Table 4.12 presents the latencies of both the original and CORDed versions of the programs. As can be seen, the performance improvements range from 0 to 40 percent. Table 4.13 gives the cache miss rates for the CORDed benchmarks. Comparing these numbers with Table 4.3, we can see that the CORDed executables exhibit instruction cache miss rates that are 20-100 percent lower than those for the original 91
Table 4.13 Cache Miss Rates for CORDed Protocols Protocol Level 1 Level 1 Con guration Instr Data TCP Send Cksum O 6.10% 6.00% TCP Send Cksum On 3.10% 7.60% TCP Recv Cksum O 4.70% 2.50% TCP Recv Cksum On 1.70% 15.70% UDP Send Cksum O 0.00% 0.00% UDP Send Cksum On 0.10% 1.20% UDP Recv Cksum O 0.60% 1.60% UDP Recv Cksum On 0.20% 17.80%
Level 2 Uni ed 0.70% 0.80% 1.40% 7.30% 22.20% 7.70% 9.20% 10.80%
executables. In the case of the UDP send side experiment without checksumming, we see that the rearranged executable achieves 100 percent hit rates in both the instruction and data caches! This shows how data references can be indirectly improved by changing instruction references. In this case, the changes have removed a con ict in the L2 uni ed cache between instructions and data, and subsequently eliminating any invalidations to the L1 caches forced by the inclusion property.
4.7 Conclusions and Future Work In this chapter we have examined cache behavior of network protocols. We summarize our ndings as follows:
Instruction cache behavior is signi cant. Despite previous work's emphasis on reducing data references (for example, in ILP), we nd that instruction cache behavior has a larger impact on performance in most scenarios that data cache behavior. The exception is with protocol architectures that copy data; in this case the the data cache behavior is more signi cant when packets are large.
92
Cold cache performance falls dramatically. In cases where caches are cold before packet processing, latencies are 3 times longer for UDP and 4 times longer for TCP.
Network protocols are well-suited to RISC architectures. We nd that network protocols use a much smaller set of instructions (i.e., load, store, add, shift, and branch), compared to application code (as embodied by the SPEC95 benchmark suite). However, dierent protocols use these instructions in dierent quantities.
Associativity improves performance. We demonstrate that con ict misses occur in network protocols, most signi cantly in TCP, and that associativity can remove most of these misses.
Larger caches improve performance. Larger caches improve performance, with TCP being more sensitive to cache size than UDP.
Future architectures reduce the gap. Network protocols should scale well with clock speed on future machines, except for two important scenarios: protocol architectures that copy data, and when protocols execute out of a cold cache. Avoiding copies will only become more important for protocol performance on future architectures.
Code layout is eective for network protocols. Simple compiler-based tools such as CORD that do pro le-guided code positioning are eective on network protocol software, improving performance by up to 40 percent, and reducing network protocol software's demands on the memory system. These results indicate that instruction-cache centric optimizations hold the most promise, even though larger primary caches with small associativities are becoming the norm. They also indicate that eorts to improve i-cache performance of complex protocols such as TCP are worthwhile. However, simpler protocols such as UDP and 93
IP probably do not warrant the eort, in that small amounts of associativity and automated tools such as CORD are sucient.
94
CHAPTER 5 SUMMARY AND FUTURE WORK 5.1 Summary of the Dissertation In this section we summarize the research presented in this dissertation. In Chapter 1 we discussed the rise of client/server computing, the inherent asymmetry in that model, and how this led to the need for high-performance networked information servers. We identi ed three research areas in the context of network support for high-performance servers and discussed the motivations for each of them. These areas are packet-level parallelism, support for secure servers, and cache behavior of network protocols. In Chapters 2, 3 and 4 we discussed each of these areas in detail. In Chapter 2 we outlined dierent approaches for using of parallelism in network protocol processing. We identi ed signi cant performance issues in network protocol processing on shared-memory multiprocessors when packets are used as the unit of concurrency, i.e., a thread is assigned to each packet. We presented a packet-level parallel implementation of a core TCP/IP protocol stack, described our experimental environment, and discussed the protocols used in our studies. We evaluated the available parallelism for packet-level parallelism, showing that it varies depending on the protocols and number of connections used. Our results showed good available parallelism for connectionless protocols such as UDP, but limited speedup using TCP within a single connection. However, we found that using multiple connections improves available parallelism. We found that packet ordering plays a key role in determining single-connection TCP performance. We showed how locking structure impacts performance, and that a complex protocol with large connection state yields 95
better speedup with a single lock than with multiple locks. We found that exploiting cache anity and avoiding contention is signi cant. In Chapter 3 we showed how parallelism can be used to improve cryptographic protocol performance. We discussed issues in software implementation of security protocols, described the protocols that we studied, and brie y outlined the parallel implementations that we used. We evaluated available parallelism for several dierent cryptographic protocols using both packet-level and connection-level parallelism. Both were found to demonstrate excellent available parallelism by showing linear speedup with several Internet-based cryptographic protocol stacks. In Chapter 4 we presented a performance study of memory reference behavior in network protocol processing, using an execution-driven simulator. We presented our architectural simulator, and quanti ed the accuracy of the simulator through validation. We characterized the cache behavior of a uniprocessor protocol stack, determining statistics such as cache miss rates and percentage of time spent waiting for memory. We evaluated the sensitivity of network protocols to the host architecture, varying factors such as cache size and associativity. We showed that network protocol cache behavior varies widely, with miss rates ranging from 0 to 28 percent, depending on the scenario. We found that instruction cache behavior has the primary eect on protocol latency under most cases, and that cold cache behavior is very dierent from warm cache behavior. We demonstrated the upper bounds to performance that can be expected by improving memory behavior, and the impact of features such as associativity and larger cache sizes. We showed that TCP is more sensitive to cache behavior than UDP, gaining larger bene ts from improved associativity and bigger caches. We found that network protocols are well-suited to RISC architectures, and that network protocols should scale well with CPU speeds in the future.
96
5.2 Suggestions for Future Work One of the more straightforward directions for future work would be to examine other aspects of network protocols other than data transfer. Given the rise of connection-oriented services such as the World-Wide Web and continuous media, the aspect of connection management and its attendant overhead require investigation. Similarly, other styles of protocols, most notably remote procedure call (RPC), would be interesting to explore. Another direction for the future would be to extend our work in support for secure servers. We have discussed more ne-grained approaches to parallelism to support high-performance cryptography, for example decrypting DES packets in parallel. Designing, implementing, and evaluating such an approach remains to be done, to see what sort of parallelism is actually achievable in practice. A third direction for further work is to examine in more detail the interaction of network protocol software and computer architecture. There are several important factors in modern computer architecture that we have not yet investigated. Multiple instruction issue, non-blocking caches, and speculative execution are all emerging in the latest generations of microprocessors. Evaluating network protocol processing in the presence of these architectural features remains to be done. Another aspect is the impact of instruction set complexity on network protocol performance. Our results have been obtained on a typical RISC microprocessor. Given the widespread commercial adoption of the Intel x86 architecture, a CISC instruction set, it would be interesting to examine cache behavior and instruction set usage on this platform. We speculate that, given the more compact instruction representation on CISC machines, the data cache will play a more signi cant role. Finally, extending the cache simulator to model completely the multiprocessor features of our SGI Challenge platform would allow us to study in ne-grained detail the memory reference patterns of network protocols on shared-memory multiprocessors. 97
APPENDIX VALIDATING AN ARCHITECTURAL SIMULATOR In this appendix, we report on our experiences in constructing the architectural simulator used in Chapter 4. We describe our simulator, enumerate our assumptions, show our validation approach, and demonstrate accuracy results averaging under 5 percent. We conclude with the lessons learned in validating an architectural simulator.
A.1 Introduction In this appendix, we report on our experiences in building an architectural simulator that is meant to accurately capture performance costs of a machine for a particular class of software, namely, network protocol stacks such as TCP/IP. The simulator models a single processor of our Silicon Graphics Challenge shared-memory multiprocessor, which has 100 MHz MIPS R4400 chips and two levels of cache memory. The purpose of this simulator is to understand the performance costs of a network protocol stack executing in user space on our SGI machine, and to guide us in identifying and reducing bottlenecks, as detailed in Chapter 4. The primary goal of this simulator has been to accurately model performance costs for our SGI machine. Much of the simulation literature discusses the tradeo between speed and accuracy, and describes techniques for making simulations fast. However, accuracy is rarely discussed (notable exceptions include [10, 25, 36]), and the tradeo between accuracy and speed has not been quantitatively evaluated. Given that our simulator is meant to capture performance costs, it must be more than an emulator that duplicates the execution semantics of the hardware or counts events such as 98
cache misses. The simulator should perform similarly to our actual hardware, i.e., an application taking N time units on the real hardware should take N simulated time units on the simulator. We quantify accuracy in terms of how closely an application's performance on the simulator comes to matching the performance on the real hardware. Our simulator is designed for a speci c class of software, namely, computer network communication protocols such as TCP/IP, and we use them to evaluate our simulator's accuracy. Network protocols most closely resemble integer benchmarks, and use, essentially, load, store, control, and simple arithmetic instructions. Our protocol benchmarks have cache hit rates ranging from 75-100 percent, and spend between 15 and 75 percent of time waiting for memory. We also evaluate accuracy on memory-intensive microbenchmarks from LMBench [78]. We have not evaluated accuracy on numeric (i.e., oating-point) benchmarks such as the SPEC 95 FP suite. On the workloads that we have tested, we nd that the simulator predicts latencies that are, on average, within 5 percent of the actual measured latencies.
A.2 Architectural Simulator Our architectural simulator is built using MINT [122], a toolkit for implementing multiprocessor memory reference simulators. MINT interprets a compiled binary directly and executes it, albeit much more slowly than if the binary was run on the native machine. This process is called direct execution. MINT is designed for use with MIPS-based multiprocessors, such as our SGI machines, and has support for the multiprocessor features of IRIX. Unlike several other simulation packages, it only requires the binary executable of the program1. As a consequence, all the application source does not need to be available, and the application does not need to be modi ed 1
MINT does not yet support dynamic linking; thus, the binary must be statically linked.
99
for use in the simulator. This means that the same exact binary is used on both the actual machine and in the simulator. A simulator built using MINT consists of 2 components: the front end, provided by MINT, which handles the interpretation and execution of the binary, and a back-end, supplied by the user, that keeps track of cache state and provides the timing properties that are used to emulate a target architecture. The front end is usually called a trace generator, and the back end a trace consumer. On each memory reference, the front end invokes the back end, passing the appropriate memory address. Based on its internal state, the back end returns a value to the front end telling it whether to continue (for example, on a cache hit) or to stall (on a cache miss). We have designed and implemented a back end for use with MINT to construct a uniprocessor simulator for our 100 MHz R4400-based SGI Challenge. The primary goal of the simulator has been performance accuracy; thus, we have attempted to ensure the timing accuracy of the simulator. Accuracy generally depends on a number of issues:
The accuracy of the assumptions made by the simulator, The accuracy to which instruction costs (e.g., branches and adds) are modeled, The accuracy to which memory references (e.g., cache hits) are modeled. We describe our approach to each of these issues in turn.
A.2.1 Assumptions Several assumptions are used, most of which are intrinsic to MINT:
We ignore context switches, TLB misses, and potentially con icting operating system tasks; the simulation essentially assumes a dedicated machine. This assumption is required by MINT and most other simulators. 100
Unless otherwise speci ed, all instructions and system calls take one cycle. Certain exceptional library calls, such as malloc(), also take one cycle. This is described in more detail below. This assumption is required by MINT and most other simulators.
We assume that the heap allocator used by MINT is the same as used by the IRIX C library, or that any dierences between the two allocators does not impact accuracy. This assumption is required by MINT.
We assume that virtual addresses are the same as physical addresses. For virtually-indexed caches such as those on-chip in the R4400, this is accurate. However, in our SGI systems, the second-level cache is physically indexed. The mapping between virtual and physical addresses on the real system is determined by the IRIX operating system. It is reported that IRIX 5.3 uses page coloring [24, 67] as a virtual-to-physical mapping strategy. In page coloring, whenever a new virtual-to-physical mapping is created, the OS attempts to assign a free physical page so that both the virtual and physical addresses map to the same bin in a physically indexed cache. This way, pages that are adjacent in virtual memory will be adjacent in the cache as well. However, if a free physical page that meets this criterion is not available, another free page will be chosen. Thus, our assumption matches the behavior of the operating system if the OS is successful in nding matching pages. Unfortunately, there is no way that we are aware of that allows us to determine the virtual-physical mappings and thus see how well the OS is nding matched pages. Since the OS may choose a non-matching page, the virtual address and the physical address may map to dierent bins in the L2 cache. Thus, an application executing in the simulator may experience con icts between two lines in L2 that would not occur in the real system, and vice-versa. The impact on accuracy may be exacerbated by 101
Table A.1 Instruction Frequencies Instr Num % mul.d 1 0.01 bgez 109 0.70 or 152 0.97 andi 180 1.15 beq 367 2.34 slti 1039 6.64 addu 1112 7.10 bne 1219 7.79 addiu 1610 10.28 st 2284 14.59 nop 3560 22.74 lw 3619 23.12 Total 15656 100.00
the inclusion property [7], which requires that, for coherency reasons, all lines cached in L1 must be held in L2. If a (possibly erroneous) L2 con ict forces a line to be removed, it must be invalidated in L1 as well.
A.2.2 Modeling Instruction Costs One of the major assumptions in MINT is that all instructions and replaced functions (such as malloc() and uspsema()) execute in one cycle. For most instructions on RISC architectures, this is a reasonable assumption. For a few instructions, however, this assumption is incorrect. Integer divides, for example, take 75 cycles. However, MINT allows the user to change the times associated for each instruction via a supplied le that lists instructions and functions along with their simulated costs. To see which instructions are actually being used by our applications, we developed a MINT back-end tool that counts dynamic instruction use and prints out a histogram. Table A.1 gives an example of instruction frequencies for a program. This allows us to focus our attention on making sure that the time values for these instructions are correct. In the above example, it is much more important that the time for a branch 102
Table A.2 Instructions that take more than 1 cycle Instruction Time in or Function Cycles div 75 divu 75 mult 2 multu 2 uspsema 113 usvsema 113 malloc 70 free 70 sginap 6000 gettimeofday 1580
(bne, beq) be correct than for a oating point multiply (mul.d), since branches occur orders of magnitude more frequently. We then wrote micro-benchmarks stressing use of those instructions and functions in order to measure their cost on our system. We also wrote a similar tool to count the use of replaced functions and to count data sizes (e.g., byte, word) of loads and stores. Unfortunately, only a small subset of timing values are available in the MIPS R4000 Microprocessor Users Manual [50]. For instructions not listed there, we needed to construct micro-benchmarks to determine their cycle times. Table A.2 presents the cycle times of instructions, functions, and system calls used that do not follow the single-cycle instruction assumption. These values are fed into MINT via the cycle le. The list in Table A.2 is far from complete; these happen to be the instructions and functions that our benchmarks use. MINT's original notion of time was solely in cycles. We convert this notion into real time by setting a cycle time in nanoseconds.
A.2.3 Pipelining in the R4000 Part of the single-cycle assumption of MINT means that the R4000 pipeline is not modeled precisely. The single cycle assumption implies that the pipeline never stalls, 103
which is not true under certain conditions. For example, if an instruction loads a value from memory into a register, and the subsequent instruction uses that register, the latter instruction will stall. This stalling is called a pipeline interlock [51]. MINT does not yet model pipeline interlocks. To accurately model load delay pipeline interlocks, we instead keep track of loads in the back end. In the R4000, loads have a 2-cycle load delay. This means we need to keep track of at most 2 registers, since only 1 load can be issued each cycle, and we need only track a loaded register for 2 cycles. Whenever a register is loaded, we keep track of it and allow it to age over time. On each instruction, we look at the registers used and compare them with those recently loaded. If any of the registers are identi ed as having not completed their load delay, the instruction will stall one or two cycles as appropriate. Branches and jumps can also cause stalls in the MIPS pipeline. Jumps, or unconditional branches, take 4 cycles. Branches are more complex than jumps in that they take 4 cycles if the branching condition is true, and 2 cycles otherwise. In either case, one cycle is exposed to the compiler as a branch delay slot, which the compiler will attempt to ll with a useful instruction if possible, or a NOP otherwise. MINT will charge a cost for the instruction in this visible delay slot. This implies that we should set the cost value for a jump at 3 cycles, and the cost for branch instructions to 3 or 1 cycles depending on whether the branch is taken. However, MINT cannot distinguish between these cases, and we are only allowed to set a single value for a branch instruction. To accurately model the dynamic costs of branches, the back end keeps track of the program counter on every instruction and data reference. When the PC does not change to the next sequential instruction (i.e., changes by anything other than 4), the back end delays by 2 cycles to emulate the branch delay cost. This accurately covers
104
Table A.3 Read and write times in cycles Layer in Read Write Hierarchy time time L1 Cache 0 0 L2 Cache 11 11 Challenge Bus 141 147
the cost of both jump instructions and all taken branches, so that we do not need to change their cycle time values.
A.2.4 Modeling Memory References We used the memory striding benchmarks from LMBench [78] to measure the cache hit and miss latencies for all three levels of the memory hierarchy: L1, L2, and main memory. These in turn gave us values with which to parameterize the back-end cache simulator. Table A.3 lists the cycle times to read and write the caches on the 100MHz SGI Challenge (IP19). The same memory stride programs were then run in the simulator, using the table of modi ed instruction times, to ensure that the simulated numbers agreed with those from the real system. Figure A.1 shows the LMBench read memory access time as a function of the area walked by the stride benchmark, as run on our 100 MHz R4400 SGI Challenge. We call this graph a memory signature. The memory signature illustrates the access times of the rst level cache, the second-level cache, and main memory. When the area walked by the benchmark ts within the rst level cache (i.e., is 16 KB or less), reading a byte in the area results in a rst-level cache hit and takes 20 nanoseconds. When the area ts within the second level cache (i.e., is between 16 KB and 1 MB in size), reading a byte results in a second-level cache hit and takes 134 nanoseconds. If the area is larger than 1 MB, main memory is accessed, and 105
Predator (SGI IP19) READ times, 100MHz R4400 (10 ns clock), 16KB Split I/D 1st−level cache, 1MB 2nd−level unified Direct−mapped, write−back Stride = 128 Stride = 64 Stride = 32 Stride = 16 Stride = 8 Stride = 4
Stride = 128 Stride = 64 Stride = 32 Stride = 16 Stride = 8 Stride = 4
1000
Latency in Nanoseconds
1000
Latency in Nanoseconds
Simulated Predator (SGI IP19) READ times, 100MHz R4400 (10 ns clock), 16KB Split I/D 1st−level cache, 1MB 2nd−level unified Direct−mapped, write−back
100
10
100
10
100
101
102
103
104
100
Array Size in KB
Figure A.1 Actual Read Times
101
102
103
104
Array Size in KB
Figure A.2 Simulated Read Times
Table A.4 LMBench real and simulated values in sec. Layer in Real Simulated Di Hierarchy time time (%) L1 Cache 20 20 0 L2 Cache 131 134 2 Challenge Bus 1440 1440 0
the time to read a byte is 1440 nanoseconds. Note that the scales in Figure A.1 are logarithmic in both the x and y axes. Figure A.2 shows the memory signature of the same binary being run on the simulator for the same machine. Table A.4 lists the numbers from Figures A.1 and A.2 in textual form for comparison. As can be seen, the simulator models the cache memory behavior very closely.
A.2.5 Summary of Validation Sequence In general, we use a sequence of actions to validate a particular application. This sequence can be repeated several times depending on the application. As each new application is introduced:
106
Table A.5 Macro benchmark times (sec) and relative error Benchmark Simulated Real Error (%) TCP Send, Cksum OFF 76.63 78.58 2.48 TCP Send, Cksum ON 147.84 146.66 -0.81 UDP Send, Cksum OFF 18.43 15.97 -15.40 UDP Send, Cksum ON 71.99 70.30 -2.41 TCP Recv, Cksum OFF 58.06 62.65 7.33 TCP Recv, Cksum ON 190.47 198.39 3.99 UDP Recv, Cksum OFF 33.80 32.84 -2.95 UDP Recv, Cksum ON 161.78 158.84 -1.85 Average Error 4.65
It is run on an instruction-frequency counting tool to instrumented its instruction usage.
If any previously unexamined instructions are used in a signi cant fashion, appropriate micro-benchmarks for those instructions are produced and timings ascertained.
The table of instruction times is updated to re ect the newly determined values. The stride benchmarks are re-run inside the simulator to ensure that the memory latencies are still accurate.
Finally the application is run on the cache simulator.
A.3 Validation Results Table A.5 lists the current set of benchmarks, with their corresponding real and simulated latencies in microseconds, and the relative error. Error is de ned as
V alue ? Real V alue) 100 Error = (SimulatedReal V alue A negative error means the simulator underestimates the real time; a positive value means it overestimates the real time. The average error is calculated as the 107
Table A.6 Macro benchmark times (sec) and relative error
Benchmark Simulated Real Error (%) TCP Send COPY ON Cksum OFF 201.27 200.47 -0.40 TCP Send COPY ON Cksum ON 268.55 264.58 -1.50 UDP Send COPY ON Cksum OFF 131.83 126.19 -4.47 UDP Send COPY ON Cksum ON 184.55 185.39 0.45 TCP Recv COPY ON Cksum OFF 248.10 258.06 3.86 TCP Recv COPY ON Cksum ON 313.47 327.21 4.20 UDP Recv COPY ON Cksum OFF 218.35 217.68 -0.31 UDP Recv COPY ON Cksum ON 278.24 267.21 -4.13 CORD TCP Send Cksum OFF 72.64 68.72 -5.70 CORD TCP Send Cksum ON 148.42 144.76 -2.53 CORD UDP Send Cksum OFF 12.54 13.49 7.10 CORD UDP Send Cksum ON 66.05 65.32 -1.12 CORD TCP Send COPY ON Cksum OFF 197.10 190.59 -3.42 CORD TCP Send COPY ON Cksum ON 268.97 251.41 -6.98 CORD UDP Send COPY ON Cksum OFF 126.02 127.36 1.05 CORD UDP Send COPY ON Cksum ON 178.54 176.03 -1.43 mean of the absolute values of the individual errors. This is to prevent positive and negative individual values from canceling each other out. Note the average error is under 5 percent, with the worst case error being about 15 percent. Table A.6 presents more accuracy results, this time modifying the protocol benchmarks by adding copies (COPY ON), or by executing the CORDed versions of the executables. CORD [113] is a binary re-writing tool that uses pro le-guided code positioning [96] to reorganize executables for better instruction cache behavior. An original executable is run through Pixie [114] to determine its runtime behavior and pro le which procedures are used most frequently. CORD uses this information to re-link the executable so that procedures used most frequently are grouped together. This heuristic approach is meant to minimize the likelihood that \hot" procedures will con ict in the caches, both in the L1 instruction cache and in the L2 uni ed cache.
108
One side aect of this is that CORDed executables tend to have better accuracy than regular executables. Since one of the simulator's assumptions is that virtual addresses are the same as physical addresses, which is not true for the L2 cache, part of the simulator's accuracy depends on modeling con icts in L2 correctly. The larger the number of con icts, the more likely the simulator will not capture their cost accurately. Similarly, the fewer the number of con icts, the less impact the con icts have on performance, and the less impact the virtual = physical assumption has on accuracy. For example, one protocol benchmark, the UDP Send without checksumming, has the worst accuracy on the simulator with an error of 15 percent. In this benchmark, all the L1 data cache misses are caused by evictions that are the result of a con ict in the L2 cache2 . We believe this is the main cause of inaccuracy in this benchmark. The CORDed version of this executable does not exhibit this behavior, and its 7 percent error is half that of the regular executable. Due to time constraints, we could not run every permutation of every benchmark. However, given the range of cache hit rates and instructions used that have been exercised by the simulator, we feel con dent that it is very accurate for this class of applications.
A.4 Sample Output Here we present a sample output from the simulator, in this case from the sendside UDP experiment with checksumming disabled. The sample gives an idea of the information captured by the simulator. [L1 I Cache] Stats: Operation: Number READ: 196159176 WRITE: 0 READ_EX: 0 TOTAL: 196159176
(
542598 Hits 188291498 0 0 188291498
invalidates, 271299 evicts, ( % ) Misses ( % ) ( 95.99) 7867678 ( 4.01) ( 0.00) 0 ( 0.00) ( 0.00) 0 ( 0.00) ( 95.99) 7867678 ( 4.01)
439325653 cycles) Cycles ( % ) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) ( 0.00)
2 The numbers are given in the example in Section A.4. Note the number of evictions in the L1 data cache is the essentially the same as the number of misses.
109
[L1 D Cache] Stats: Operation: Number READ: 49920127 WRITE: 36084989 READ_EX: 0 TOTAL: 86005116
(
542598 Hits 49648826 36084989 0 85733815
invalidates, 271299 evicts, ( % ) Misses ( % ) ( 99.46) 271301 ( 0.54) (100.00) 0 ( 0.00) ( 0.00) 0 ( 0.00) ( 99.68) 271301 ( 0.32)
439325653 cycles) Cycles ( % ) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) ( 0.00)
[L2 U Cache] Stats: Operation: Number READ: 8138979 WRITE: 0 READ_EX: 0 TOTAL: 8138979
(
0 Hits 7596379 0 0 7596379
invalidates, 0 evicts, ( % ) Misses ( % ) ( 93.33) 542600 ( 6.67) ( 0.00) 0 ( 0.00) ( 0.00) 0 ( 0.00) ( 93.33) 542600 ( 6.67)
439325653 cycles) Cycles ( % ) 83560169 ( 19.02) 0 ( 0.00) 0 ( 0.00) 83560169 ( 19.02)
[IP19 Bus] Stats: ( Operation: Number READ: 542600 WRITE: 0 READ_EX: 0 TOTAL: 542600
0 invalidates, Hits ( % ) 542600 (100.00) 0 ( 0.00) 0 ( 0.00) 542600 (100.00)
0 Misses 0 0 0 0
evicts, 439325653 cycles) ( % ) Cycles ( % ) ( 0.00) 76506600 ( 17.41) ( 0.00) 0 ( 0.00) ( 0.00) 0 ( 0.00) ( 0.00) 76506600 ( 17.41)
Instruction loads: load stalls: stores: branches: jumps: control: adds: subtracts: muls: divs: shifts: logicals: sets: immediates:
Number 49920127 28215096 35542391 16007752 13564950 29572702 32014393 2984289 271299 1111 4069485 12751053 5968578 1899093
( ( ( ( ( ( ( ( ( ( ( ( ( ( (
% ) 25.45) 56.52) 18.12) 8.16) 6.92) 15.08) 16.32) 1.52) 0.14) 0.00) 2.07) 6.50) 3.04) 0.97)
Cycles 49920127 33912375 35542391 37711672 40694850 78406522 32014393 2984289 542598 83325 4069485 12751053 5968578 1899093
( % ) ( 11.36) ( 7.72) ( 8.09) ( 8.58) ( 9.26) ( 17.85) ( 7.29) ( 0.68) ( 0.12) ( 0.02) ( 0.93) ( 2.90) ( 1.36) ( 0.43)
loads: stores: control: ariths: nops:
49920127 35542391 29572702 59959301 20891134
( ( ( ( (
25.45) 18.12) 15.08) 30.57) 10.65)
83832502 35542391 78406522 60312814 20891134
( 19.08) ( 8.09) ( 17.85) ( 13.73) ( 4.76)
mem: cpu: total:
0 ( 0.00) 0 ( 0.00) 196159176 (100.00)
160066769 ( 36.43) 279258884 ( 63.57) 439325653 (100.00)
As can be observed, the simulator lists both how many times an event happened as well as what percentage of the total time to which the event contributed. For example, we see that control operations make up 15 percent of the instruction usage, but 110
contribute 17.85 percent of the total cycles. On average, a branch takes roughly 2.7 cycles. This tells us even if branches cost a single cycle, our application performance would only improve about 10 percent.
A.5 Improving Accuracy There are several possibilities for improving the accuracy of the simulator:
modeling the pipeline more accurately. This would improve accuracy on more complex events, particularly combinations of instructions that aect the pipeline. The interaction between the pipeline and instructions with overlapped execution is not captured fully. For example, a multiply instruction placed in a branch delay slot is probably not modeled precisely in terms of how much (or little) the pipe is stalled. However, adding this amount of detail requires a large amount of work either to the MINT front end or to the back-end simulator. It is not clear what the overall impact on accuracy would be, or whether the amount of work would be worth the eort.
virtual addresses vs. physical addresses. We can examine more carefully the assumption that physical addresses are the same as virtual addresses. However, this assumption is usually correct in kernel code. In addition, without access to more information about IRIX's policies for assigning virtual pages to physical ones, we cannot know whether we are improving accuracy.
improving L2 write accuracy. At the moment, L2 write costs are over-estimated by the simulator by about 30 percent. This appears to be a consequence of the presence of a write buer and the lack of modeling pipeline eects. This might be able to be xed; however, write misses that hit in L2 are extremely rare, so improving this will probably have little impact on overall accuracy.
111
What is interesting is how well the machine is being modeled despite many features not being represented. For example, the TLB, store buer, and write buer are all ignored, with apparently little impact on overall accuracy. Our belief is that improving accuracy will only be necessary upon introducing new classes of software, most obviously oating-point benchmarks. Modeling pipeline slips and stalls will become more important for multiple-cycle instructions such as multiplies and divides.
A.6 Lessons Learned Several general lessons were learned (or re-learned) over the process of constructing and validating the simulator. Some of these are general software engineering principles, but mostly they relate to accuracy and validation.
Frequency is key. The frequency of events plays a key role in the overall accuracy. This can be thought of as the 90/10 rule or RISC approach to validation. One way to think of validation is trying to minimize error, where error is de ned as follows:
Error =
X freq(E ) (real(E ) ? sim (E )) n
i
i
i
i=1
Here, E is event i in the system, freq(E ) is the frequency of event i, real(E ) is the real cost of event i, and sim(E ) is the simulated cost of event i. Examples of events include uses of a particular instruction, cache misses, or taken branches. i
i
i
i
Obviously, the frequency of a particular event is application-dependent. To know the frequency of various events, one must be aware what the application is doing, i.e., its dynamic behavior at run time. As mentioned before, modeling events that happen frequently is crucial. Similarly, events that happen occasionally can be modeled less carefully; however, their cost must be taken into 112
account. For example, L2 write misses are very rare, but their cost is so high (146 cycles) that they must be accounted for. Events that never happen can be ignored, or assigned simple costs. For example, our simulator completely neglects oating point costs, but given that not a single oating point instruction is executed, this neglect does not impact our accuracy. It does, however, save us time: it eliminates the need to write micro-benchmarks to measure event costs, obviates the requirement to implement appropriate functionality in the simulator, and improves the simulator performance by not wasting cycles testing for events that never happen. The disadvantage of this is that the simulator is eectively tuned to the application, in that dierent applications can have dierent frequencies of events. However, the simulator can be tuned for a new application as necessary.
The law of diminishing returns applies. A corollary of the frequency lesson, accuracy tends to get harder and harder to achieve over time. For example, adding the functionality to model the load delay pipeline took a couple of days to design, implement, test and debug. Adding this feature changed the accuracy of some of the checksummed protocol benchmarks from 20 percent to 5 percent. However, it only improved the average accuracy of the whole suite of benchmarks by one percent, and slowed down the simulator by a factor of 2-3.
Use tools. Events that never happen can be ignored, however, one must be certain that they never happen! Writing special-purpose tools to determine frequency of cases is extremely useful.
Use microbenchmarks. Microbenchmarks that provoke certain types of behavior both allow the unit cost of those events to be measured and give a means to test the accuracy of the simulator on those events. This makes the process of 113
validating the simulator essentially self-correcting. Of course, one should not bother writing micro-benchmarks for events that never happen.
Build in lots of self-checking code. The simulator has huge amounts of assertion checks and tracing print statements that are de ned by #ifdef's. These checks are normally compiled away for speed. However, turning them on explicitly tested all assumptions and that variables were in consistent states. Particularly, whenever a major component was added to the simulator, we would rst run things with the full set of checks turned on to make sure any assumptions had not been violated by the new functionality. Some of these assumptions were type issues that would be addressed by using a more type-safe language than C, such as Modula-3. However, most were assumptions about which state (out of several possible correct ones) a variable (such as a cache line) was in.
Automate, automate, automate. This greatly reduces the opportunities for error, and makes it easy to regenerate results when the simulator changes. We wrote many scripts and post-processing tools to do things such as calculate the relative error.
Iterate until satis ed. An implementer can go through the cycle of adding functionality, re-executing benchmarks, and re-parameterizing the simulator with the new results until the accuracy is satisfactory. We iterated through the validation process roughly 15 times. The combination of these factors along with the accuracy results gives us great con dence in the resulting code.
A.7 Summary This appendix has reported our experiences in building an execution-driven architectural simulator that is meant to accurately capture performance costs for a 114
particular class of software, namely, network protocol stacks. The simulator models a single processor of our Silicon Graphics Challenge shared-memory multiprocessor, which has 100 MHz MIPS R4400 chips and two levels of cache memory. We have presented our approach to validation and shown average accuracy of within 5 percent for our class of applications. We have described the lessons learned in validation, chief of which is that modeling frequent events accurately is key.
115
BIBLIOGRAPHY [1] Allison, B. DEC 7000/10000 Model 600 AXP multiprocessor server. In Proceedings IEEE COMPCON, pages 456{464, San Francisco CA, February 1993. [2] American National Standards Institute (ANSI). American national standard data encryption standard. Technical report ANSI X3.92-1981, Dec. 1980. [3] Anderson, T. E., Lazowska, E. D., and Levy, H. M. The performance implications of thread management alternatives for shared-memory multiprocessors. IEEE Transactions on Computers, 38(12):1631{1644, December 1989. [4] Anderson, T. E., Levy, H. M., Bershad, B. N., and Lazowska, E. D. The interaction of architecture and operating system design. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 108{120, Santa Clara, CA, 1991. [5] Asthana, A., Delph, C., Jagadish, H. V., and Kryzyzanowski, P. Towards a gigabit IP router. Journal of High Speed Networks, 1:218{288, 1992. [6] Atkinson, R. Security architecture for the Internet Protocol. Request for Comments (Draft Standard) RFC 1825, Internet Engineering Task Force, Aug. 1995. [7] Baer, J.-L. and Wang, W.-H. On the inclusion property for multi-level cache hierarchies. In Proceedings 15th International Symposium on Computer Architecture, pages 73{80, Honolulu Hawaii, June 1988. [8] Banks, D. and Prudence, M. A high-performance network architecture for a PA-RISC workstation. IEEE Journal on Selected Areas in Communications, 11(2):191{202, Feb. 1993. [9] Barton, J. M. and Bitar, N. A scalable multi-discipline, multiple-processor scheduling framework for IRIX. In IPPS Workshop on Job Scheduling Strategies for Parallel Processing, pages 24{40, Santa Barbara, CA, Apr. 1995. [10] Bedichek, R. C. Talisman: Fast and accurate multicomputer simulation. In Proceedings of the ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 14{24, Ottawa, Canada, May 1995. [11] Bjorkman, M. The xx-Kernel: an execution environment for parallel execution of communication protocols. Dept. of Computer Science, Uppsala University, June 1993. 116
[12] Bjorkman, M. and Gunningberg, P. Locking eects in multiprocessor implementations of protocols. In ACM SIGCOMM Symposium on Communications Architectures and Protocols, pages 74{83, San Francisco, CA, Sept. 1993. [13] Blackwell, T. Speeding up protocols for small messages. In ACM SIGCOMM Symposium on Communications Architectures and Protocols, Stanford, CA, Aug. 1996. [14] Blumrich, M. A., Dubnicki, C., Felton, E. W., Li, K., and Mesarina, M. R. Virtual-memory mapped interfaces. IEEE Micro, 15(1):21{28, Feb. 1995. [15] Boden, N. J., Cohen, D., Felderman, R. E., Kulawik, A. E., Seitz, C. L., Seizovic, J. N., and Su, W.-K. Myrinet: A gigabit-per-second local area network. IEEE Micro, 15(1):29{36, Feb. 1995. [16] Borman, D. NTCP: A proposal for the next generation of TCP and UDP. In Submission to the End2End-Interest mailing list, pages 1{37, Eagan, MN, 1993. Cray Research. End2End archives available via FTP at ftp.isi.edu. [17] Borman, D., Braden, R., and Jacobson, V. TCP extensions for high performance. Request for Comments (Proposed Standard) RFC 1323, Internet Engineering Task Force, May 1992. [18] Boykin, J. and Langerman, A. The parallelization of Mach/4.3BSD: Design philosophy and performance analysis. In Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS I), pages 105{125, Ft. Lauderdale, Fl., Oct. 1989. [19] Braun, T. and Diot, C. Protocol implementation using integrated layer processing. In ACM SIGCOMM Symposium on Communications Architectures and Protocols, pages 151{161, Cambridge, MA, Aug. 1995. [20] Braun, T. and Schmidt, C. Implementation of a parallel transport subsystem on a multiprocessor architecture. In 2nd International Symposium on HighPerformance Distributed Computing, Spokane, Washington, July 1993. [21] Braun, T. and Zitterbart, M. High performance internetworking protocol. In 15th IEEE Conference on Local Computer Networks, Minneapolis, Minnesota, Sept. 1990. [22] Braun, T. and Zitterbart, M. Parallel transport system design. Fourth IFIP WG 6.4 Conference on High Performance Networking, pages 397{412, Dec. 1992. [23] Brookes, F. P. The Mythical Man-Month: Essays on Software Engineering. Addison Wesley, Reading, Massachusetts, 1975. [24] Bugnion, E., Anderson, J. M., Mowry, T. C., Rosenblum, M., and Lam, M. S. Compiler-directed page coloring for multiprocessors. In Proceedings of the Seventh International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), Cambridge MA, October 1996. 117
[25] Calder, B., Grunwald, D., and Emer, J. A system level perspective on branch architecture performance. In Proceedings of the 28th Annual IEEE/ACM International Symposium on Microarchitecture, pages 199{206, Ann Arbor, MI, November 1995. [26] Cekleov, M. SPARCCenter 2000: Multiprocessing for the 90's! In Proceedings IEEE COMPCON, pages 345{353, San Francisco CA, February 1993. [27] Chen, J. B. and Bershad, B. N. The impact of operating system structure on memory system performance. In Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles, pages 120{133, Asheville, NC, 1993. [28] Chu, H.-K. J. Zero copy TCP in Solaris. In Proceedings of the Winter USENIX Technical Conference, San Diego, CA, Jan. 1996. [29] Clark, D. D. Modularity and eciency in protocol implementation. Request for Comments RFC 817, Internet Engineering Task Force, July 1982. [30] Clark, D. D. The structuring of systems using upcalls. In Proceedings of the Tenth ACM Symposium on Operating Systems Principles, pages 171{180, December 1985. [31] Clark, D. D., Jacobson, V., Romkey, J., and Salwen, H. An analysis of TCP processing overhead. IEEE Communications Magazine, 27(6):23{29, June 1989. [32] Clark, D. D. and Tennenhouse, D. L. Architectural considerations for a new generation of protocols. Proceedings SIGCOMM Symposium on Communications Architectures and Protocols, pages 200{208, September 1990. [33] Dalton, C., Watson, G., Banks, D., Clamvokis, C., Edwards, A., and Lumley, J. Afterburner. IEEE Network, 11(2):36{43, July 1993. [34] Die, W. and Hellman, M. E. New directions in cryptography. IEEE Transactions on Information Theory, 22(6):644{654, Nov. 1976. [35] Diot, C. and Dang, M. N. X. Using transputer in the design of high performance architectures dedicated to the implementation of OSI transport protocols. In Transputer Research and Applications 3, pages 17{25, Sunnyvale, California, Apr. 1990. [36] Diwan, A., Tarditi, D., and Moss, E. Memory-system performance of programs with intensive heap allocation. ACM Transactions on Computer Systems, 13(3):244{273, 1995. [37] Dove, K. A high capacity TCP/IP in parallel Streams. In Proceedings United Kingdom UNIX Users Group, Jan. 1990. [38] Druschel, P., Peterson, L., and Davie, B. Experiences with a high-speed network adaptor: A software perspective. In ACM SIGCOMM Symposium on Communications Architectures and Protocols, London, England, Aug. 1994. 118
[39] Druschel, P. and Peterson, L. L. Fbufs: A high-bandwidth cross-domain transfer facility. In Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles, pages 189{202, Asheville, NC, Dec 1993. [40] Eberle, H. A high-speed DES implementation for network applications. Technical Report 90, Digital Equipment Corporation Systems Research Center, Sept. 1992. [41] Edwards, A. and Muir, S. Experiences implementing a high-performance TCP in user space. In ACM SIGCOMM Symposium on Communications Architectures and Protocols, pages 196{205, Cambridge, MA, Aug. 1995. [42] Eykholt, J. R., Kleiman, S. R., Barton, S., Faulkner, R., Stein, D., Smith, M., Shivalingiah, A., Voll, J., Weeks, M., and Williams, D. Beyond multiprocessing: Multithreading the SunOS kernel. In USENIX Summer 1992, San Antonio, Texas, June 1992. [43] Fiber-distributed data interface (FDDI)|Token ring media access control (MAC). American National Standard for Information Systems ANSI X3.1391987, July 1987. American National Standards Institute. [44] Galles, M. and Williams, E. Performance optimizations, implementation, and veri cation of the SGI Challenge multiprocessor. Technical report, Silicon Graphics Inc., Mt. View, CA, May 1994. [45] Garg, A. Parallel STREAMS: a multi-processor implementation. In Proceedings of the Winter 1990 USENIX Conference, pages 163{176, Washington, D.C., Jan. 1990. [46] Giarrizzo, D., Kaiserswerth, M., Wicki, T., and Williamson, R. C. Highspeed parallel protocol implementation. First IFIP WG6.1/WG6.4 International Workshop on Protocols for High-Speed Networks, pages 165{180, May 1989. [47] Goldberg, M. W., Neufeld, G. W., and Ito, M. R. The parallel protocol framework. Technical Report 92-16, Department of Computer Science, University of British Columbia, Vancouver, B.C., Aug. 1992. [48] Goldberg, M. W., Neufeld, G. W., and Ito, M. R. A parallel approach to OSI connection-oriented protocols. Third IFIP WG6.1/WG6.4 International Workshop on Protocols for High-Speed Networks, pages 219{232, May 1993. [49] Heavens, I. Experiences in parallelisation of Streams-based communications drivers. OpenForum Conference on Distributed Systems, Nov. 1992. [50] Heinrich, J. MIPS R4000 Microprocessor Users Manual (2nd Ed.). MIPS Technologies, Inc., Mt. View, CA, 1994. 119
[51] Hennessy, J. L. and Patterson, D. A. Computer Architecture: A Quantitative Approach (2nd Edition). Morgan Kaufmann Publishers Inc., San Francisco, CA, 1995. [52] Herlihy, M. A methodology for implementing highly concurrent data objects. ACM Transactions on Programming Languages and Systems, 15(5):6{16, November 1993. [53] Hickman, K. E. and Elgamal, T. The SSL protocol. Work in progress, Internet Draft (ftp://ds.internic.net/internet-drafts/draft-hickman-netscape-ssl01.txt, June 1995. [54] Hill, M. D. A case for direct mapped caches. IEEE Computer, 21(12):24{40, December 1988. [55] Hill, M. D. and Smith, A. J. Evaluating associativity in CPU caches. IEEE Transactions on Computers, 38(12):1612{1630, December 1989. [56] Hunt, G. D. Personal communication. June 1996. [57] Hutchinson, N. C. Protocols versus parallelism. In Proceedings from the xKernel Workshop, Tucson, AZ, Nov. 1992. University of Arizona. [58] Hutchinson, N. C. and Peterson, L. L. The x-Kernel: An architecture for implementing network protocols. IEEE Transactions on Software Engineering, 17(1):64{76, January 1991. [59] Ito, M., Takeuchi, L., and Neufeld, G. A multiprocessing approach for meeting the processing requirements for OSI. IEEE Journal on Selected Areas in Communications, SAC-11(2):220{227, Feb. 1993. [60] Jacobson, V. Ecient protocol implementation. In ACM SIGCOMM 1990 Tutorial Notes, Philadelphia, PA, Sept. 1990. [61] Jacobson, V. A high performance TCP/IP implementation. In NRI Gigabit TCP Workshop, Reston, VA, Mar. 1993. [62] Jain, N., Schwartz, M., and Bashkow, T. R. Transport protocol processing at Gbps rates. In ACM SIGCOMM Symposium on Communications Architectures and Protocols, pages 188{199, Philadelphia, PA, Sept. 1990. ACM. [63] Jensen, M. N. and Skov, M. Multi-processor based high-speed communication systems. In 6th European Fibre Optic Communications and Local Area Networks Exposition, pages 430{434, Amsterdam, Netherlands, June 1988. [64] Kaiserwerth, M. The parallel protocol engine. IEEE Transactions on Networking, 1(6):650{663, Dec. 1993.
120
[65] Kay, J. and Pasquale, J. The importance of non-data touching processing overheads in TCP/IP. In SIGCOMM Symposium on Communications Architectures and Protocols, pages 259{269, San Francisco, CA, Sept. 1993. ACM. [66] Kay, J. and Pasquale, J. Measurement, analysis, and improvement of UDP/IP throughput for the DECStation 5000. In USENIX Winter 1993 Technical Conference, pages 249{258, San Diego, CA, 1993. [67] Kessler, R. E. and Hill, M. D. Page placement algorithms for large real-indexed caches. ACM Transactions on Computer Systems, 10(4):338{359, Nov. 1992. [68] Kleinman, S. Symmetric multiprocessing in Solaris 2.0. In IEEE Spring COMPCON, 1992. [69] Koufopavlou, O. G., Tantawy, A. N., and Zitterbart, M. Analysis of TCP/IP for high performance parallel implementations. In Proceedings of the 17th IEEE Conference on Local Computer Networks, pages 576{585, Minneapolis, Minnesota, Sept. 1992. [70] Koufopavlou, O. G. and Zitterbart, M. Parallel TCP for high performance communication subsystems. In Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM), pages 1395{1399, 1992. [71] La Porta, T. F. and Schwartz, M. A high-speed protocol parallel implementation: Design and analysis. Fourth IFIP TC6.1/WG6.4 International Conference on High Performance Networking, pages 135{150, Dec. 1992. [72] La Porta, T. F. and Schwartz, M. Performance analysis of MSP: A feature-rich high-speed transport protocol. In Proceedings of the Conference on Computer Communications (IEEE Infocom), pages 513{520, San Francisco, CA, Mar. 1993. IEEE. [73] Leland, W. E., Taqqu, M. S., Willinger, W., and Wilson, D. V. On the selfsimilar nature of Ethernet trac. In SIGCOMM Symposium on Communications Architectures and Protocols, pages 183{193, San Francisco, CA, Sept. 1993. ACM. [74] Lindgren, B., Krupczak, B., Ammar, M., and Schwan, K. Parallel and con gurable protocols: Experience with a prototype and an architectural framework. In Proceedings of the International Conference on Network Protocols, pages 234{242, San Francisco, CA, Oct. 1993. [75] Maly, K., Khanna, S., Mukkamala, R., Overstreet, C. M., Yerraballi, R., Foudriat, E. C., and Madan, B. Parallel TCP/IP for multiprocessor workstations. Fourth IFIP TC6.1/WG6.4 International Conference on High Performance Networking, pages 135{150, Dec. 1992.
121
[76] Marimuthu, P., Viniotis, I., and Sheu, T.-L. A parallel router architecture for high speed LAN internetworking. In 17th IEEE Conference on Local Computer Networks, pages 335{344, Minneapolis, Minnesota, Sept. 1993. IEEE. [77] McCutcheon, M. J., Ito, M. R., and Neufeld, G. W. Interfacing a multiprocessor protocol engine to an ATM network. Third IFIP WG6.1/WG6.4 International Workshop on Protocols for High-Speed Networks, pages 155{170, May 1993. [78] McVoy, L. and Staelin, C. LMBENCH: Portable tools for performance analysis. In USENIX Technical Conference of UNIX and Advanced Computing Systems, San Diego, CA, January 1996. [79] Mellor-Crummey, J. M. and Scott, M. L. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21{65, February 1991. [80] Minnich, R., Burns, D., and Hady, F. The memory-integrated network interface. IEEE Micro, 15(1):11{20, Feb. 1995. [81] Mogul, J. C. Observing TCP dynamics in real networks. In ACM SIGCOMM Symposium on Communications Architectures and Protocols, pages 305{317, Baltimore, MD, Aug. 1992. [82] Mogul, J. C., Rashid, R. F., and Accetta, M. J. The packet lter: An ecient mechanism for user-level network code. In The Proceedings of the 11th Symposium on Operating System Principles, Austin, Texas, November 1987. [83] Mosberger, D., Peterson, L. L., Bridges, P. G., and O'Malley, S. Analysis of techniques to improve protocol processing latency. In ACM SIGCOMM Symposium on Communications Architectures and Protocols, Stanford, CA, Aug. 1996. [84] Murphy, B., Zeadally, S., and Adams, C. An analysis of process and memory models to support high-speed networking in a UNIX environment. In Proceedings of the Winter USENIX Technical Conference, San Diego, CA, Jan. 1996. [85] Nahum, E., O'Malley, S., Orman, H., and Schroeppel, R. Towards highperformance cryptographic software. In Proceedings of the Third IEEE Workshop on the Architecture and Implementation of High Performance Communications Subsystems (HPCS), Mystic, Conn, Aug. 1995. [86] Nahum, E., Yates, D., O'Malley, S., Orman, H., and Schroeppel, R. Parallelized network security protocols. In Proceedings of the Internet Society Symposium Network and Distributed System Security, San Diego, CA, Feb. 1996. [87] Nahum, E. M., Yates, D. J., Kurose, J. F., and Towsley, D. Performance issues in parallelized network protocols. In First USENIX Symposium on Operating Systems Design and Implementation, pages 125{137, Monterey, CA, Nov. 1994. 122
[88] Netravali, A. N., Roome, W. D., and Sabnani, K. Design and implementation of a high-speed transport protocol. IEEE Transactions on Communications, 38(11):2010{2024, Nov. 1990. [89] Neufeld, G. W., Ito, M. R., Goldberg, M. W., McCutcheon, M. J., and Ritchie, S. Parallel host interface for an ATM network. IEEE Network, pages 24{34, July 1993. [90] Nuckolls, N. Multithreading your STREAMS device driver in SunOS 5.0. Internet Engineering, Sun Microsystems, Dec. 1991. [91] Orman, H., O'Malley, S., Schroeppel, R., and Schwartz, D. Paving the road to network security, or the value of small cobblestones. In Proceedings of the 1994 Internet Society Symposium on Network and Distributed System Security, Feb. 1994. [92] Ousterhout, J. Why aren't operating systems getting faster as fast as hardware? In Proceedings of the Summer USENIX Conference, pages 247{256, June 1990. [93] Paxson, V. and Floyd, S. Wide-area trac: The failure of Poisson modeling. In ACM SIGCOMM Symposium on Communications Architectures and Protocols, London, England, Aug. 1994. [94] Peacock, J. K., Saxena, S., Thomas, D., Yang, F., and Yu, W. Experiences from multithreading System V Release 4. In Proceedings of the Third Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS III), pages 77{91, Newport Beach, CA, Mar. 1992. USENIX. [95] Peel, R. TCP/IP networking using transputers. In Transputer Research and Applications 3, pages 27{38, Sunnyvale, California, Apr. 1990. [96] Pettis, K. and Hansen, R. C. Pro le guided code positioning. In ACM SIGPLAN `90 Conference on Programming Language Design and Implementation (PLDI), pages 16{27, White Plains, NY, June 1990. [97] Postel, J. User Datagram Protocol. Network Information Center RFC 768, pages 1{3, Aug. 1980. [98] Postel, J. Internet Protocol. Network Information Center RFC 791, pages 1{45, Sept. 1981. [99] Postel, J. Transmission Control Protocol. Network Information Center RFC 793, pages 1{85, Sept. 1981. [100] Presotto, D. Multiprocessor Streams for Plan 9. In Proceedings United Kingdom UNIX Users Group, Jan. 1993. [101] Rescorla, E. and Schiman, A. M. The secure hypertext transfer protocol. Work in progress, Internet Draft (ftp://ds.internic.net/internet-drafts/draft-ietf-wtsshttp-00.txt, July 1995. 123
[102] Rivest, R. The MD5 message-digest algorithm. Request for Comments (Informational) RFC 1321, Internet Engineering Task Force, Apr. 1992. [103] Rivest, R., Shamir, A., and Adleman, L. A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, pages 120{126, Feb. 1978. [104] Rosenblum, M., Bugnion, E., Herrod, S. A., Witchell, E., and Gupta, A. The impact of computer architecture on operating system performance. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, Copper Canyon, CO, December 1995. [105] Ross, F. E. FDDI|A tutorial. IEEE Communications Magazine, 24(5):10{17, May 1986. [106] Rutsche, E. and Kaiserwerth, M. TCP/IP on the parallel protocol engine. Fourth IFIP TC6.1/WG6.4 International Conference on High Performance Networking, pages 119{134, Dec. 1992. [107] Sabnani, K. and Netravali, A. A high speed transport protocol for datagram/virtual circuit networks. In ACM SIGCOMM Symposium on Communications Architectures and Protocols, pages 146{157, Austin, TX, Sept. 1989. ACM. [108] Salehi, J. D., Kurose, J. F., and Towsley, D. Further results in anity-based scheduling of parallel networking. Technical Report UM-CS-1995-046, Department of Computer Science, University of Massachusetts, Amherst, MA, May 1995. [109] Salehi, J. D., Kurose, J. F., and Towsley, D. The performance impact of scheduling for cache anity in parallel network processing. In Fourth IEEE International Symposium on High-Performance Distributed Computing (HPDC-4), Pentagon City, VA, Aug. 1995. [110] Saxena, S., Peacock, J. K., Yang, F., Verma, V., and Krishnan, M. Pitfalls in multithreading SVR4 STREAMS and other weightless processes. In Winter 1993 USENIX Technical Conference, pages 85{96, San Diego, CA, Jan. 1993. [111] Schmidt, D. C. and Suda, T. Measuring the impact of alternative parallel process architectures on communication subsystem performance. Fourth IFIP WG6.1/WG6.4 International Workshop on Protocols for High-Speed Networks, Aug. 1994. [112] Schmidt, D. C. and Suda, T. Measuring the performance of parallel messagebased process architectures. In Proceedings of the Conference on Computer Communications (IEEE Infocom), Boston, MA, Apr. 1995. [113] Silicon Graphics Inc. Cord manual page, IRIX 5.3. 124
[114] Smith, M. D. Tracing with Pixie. Technical report, Center for Integrated Systems, Stanford University, Stanford, CA, April 1991. [115] Speer, S. E., Kumar, R., and Partridge, C. Improving UNIX kernel performance using pro le based optimization. In Proceedings of the Winter 1994 USENIX Conference, pages 181{188, San Francisco, CA, Jan. 1994. [116] Tantawy, A. and Zitterbart, M. Multiprocessing in high performance IP routers. Third IFIP WG6.1/WG6.4 International Workshop on Protocols for High-Speed Networks, pages 235{254, May 1992. [117] Thekkath, C., Eager, D., Lazowska, E., and Levy, H. A performance analysis of network I/O in shared memory multiprocessors. Technical report, Dept. of Computer Science and Engineering FR-35 University of Washington, Seattle, WA, July 1992. [118] Touch, J. Performance analysis of MD5. In ACM SIGCOMM Symposium on Communications Architectures and Protocols, Boston MA, Aug. 1995. [119] Ullman, R. TP/IX: The next Internet. In Network Information Center RFC 1475, Menlo Park, CA, June 1993. SRI International. [120] Unix System Laboratories. Design of the streams subsystem for SVR4 ES/MP. USL Proprietary: Distribution Subject to Signed Agreement, June 1992. [121] Varghese, G. and Lauck, T. Hashed and hierarchical timing wheels: Data structures for the ecient implementation of a timer facility. In The Proceedings of the 11th Symposium on Operating System Principles, November 1987. [122] Veenstra, J. E. and Fowler, R. J. MINT: A front end for ecient simulation of shared-memory multiprocessors. In Proceedings 2nd International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Durham, NC, January 1994. [123] Yates, D. J., Nahum, E. M., Kurose, J. F., and Towsley, D. Networking support for large scale multiprocessor servers (extended abstract). In Proceedings of the Third IEEE Workshop on the Architecture and Implementation of High Performance Communications Subsystems (HPCS), Mystic, Conn, Aug. 1995. A full version of this paper is available as Technical Report CMPSCI 95-83, University of Massachusetts. [124] Yates, D. J., Nahum, E. M., Kurose, J. F., and Towsley, D. Networking support for large scale multiprocessor servers. In Proceedings of the ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, Philadelphia, Pennsylvania, May 1996. [125] Zitterbart, M. High-speed protocol implementation based on a multiprocessor architecture. First IFIP WG6.1/WG6.4 International Workshop on Protocols for High-Speed Networks, pages 151{163, May 1989. 125