A Cache Server for News - CiteSeerX

11 downloads 0 Views 528KB Size Report
Oct 31, 1996 - The newsgroups of the \Wirt- ...... Benjamin Cummings, second edition, 1994. CS]. Geo rey. Collyer and. Henry. Spencer. ... William Stallings.
DIPLOMARBEIT

A Cache Server for News ausgefuhrt am Institut fur Informationssysteme Abteilung fur Verteilte Systeme der Technischen Universitat Wien unter Anleitung von o. Univ.-Prof. Dipl.-Ing. Dr.techn. Mehdi Jazayeri und Univ.-Ass. Dipl.-Ing. Manfred Hauswirth als verantwortlich mitwirkenden Universitatsassistenten durch

Thomas Gschwind Laudongasse 28/14 A-1080 Wien O sterreich

Matrikelnummer: 9225658 Wien, im April 1997

Meiner Mutter

Zusammenfassung Usenet News bietet ein weltweit o enes Diskussionsforum, das in unterschiedliche Gruppen unterteilt ist, in dem jeder Benutzer Artikel zur Verfugung stellen kann. Das "Network News Transport Protocol\ (NNTP) wird dazu verwendet diese Gruppen und Artikel im Internet zu verteilen. Da die Anzahl der Benutzer standig ansteigt und NNTP seit seiner Er ndung nur wenig verbessert worden ist, stot es teilweise schon heute an seine Grenzen. In dieser Diplomarbeit werden wir diese Grenzen, sowie zwei Ansatze zur Losung dieser Situation demonstrieren. Eine Losungsmoglichkeit besteht darin, NNTP selbst zu verbesern und zu erweitern. Allerdings wurde dies bedeuten, da viele existierende Programme fur das heutige "News System\ erweitert oder umgeschrieben werden muten. Ein anderer Losungsansatz besteht darin, die bereits verfugbaren Pu erungstechniken auch fur Usenet News verfugbar zu machen. In diesee Diplomarbeit werden wir diesen Ansatz behandeln.

Schlagworte: Usenet News, NNTP, NNRP, Caching

Abstract Usenet News is a worldwide open forum for discussion. It o ers groups where users may submit articles. These newsgroups and articles are distributed on the Internet using the \Network News Transfer Protocol" (NNTP). Since the number of users has increased steadily and only few improvements have been applied to NNTP, NNTP begins to reach its limits. We will explain these limits and will present two strategies to improve the situation. One strategy may be the extension of NNTP. However, this requires a modi cation of the currently existing news software. The other strategy is the exploitation of caching mechanisms for Usenet News. This is the topic we will concentrate on in the following diploma thesis. We will analyze which areas can pro t by the use of a News Cache and will give a description of our News Cache implementation.

Keywords: Usenet News, NNRP, NNTP, caching

Acknowledgements This work is dedicated to my mother. I want to thank her, because I would not have been able to achieve this without her love and support. I also would like to thank Manfred Hauswirth for his supervision and his many comments and Michael Gschwind for proofreading and his comments.

Contents 1 Introduction 2 Terminology and Basic Technologies

2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The Usenet News System . . . . . . . . . . . . . . . . . . . 2.3.1 How Newsgroups and Articles are Related . . . . . 2.3.2 Articles . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 News Servers . . . . . . . . . . . . . . . . . . . . . 2.3.4 Distribution Strategy for Articles and Newsgroups . 2.3.5 News Readers . . . . . . . . . . . . . . . . . . . . .

3 Problem Domain 3.1 3.2 3.3 3.4

Catching Up a Backlog . . . . . . . . . . . . . . . . Provision of Local Newsgroups . . . . . . . . . . . . News over Networks with Limited Bandwidth . . . Solution Strategies . . . . . . . . . . . . . . . . . . 3.4.1 Server-Multiplexing by the Client . . . . . . 3.4.2 Server-Multiplexing by the News Cache . . . 3.4.3 News on Networks with Limited Bandwidth

4 Analysis and Requirements 4.1 4.2 4.3 4.4

4.5 4.6 4.7 4.8

Cache Accuracy . . . . . . . . . . . . . . Where to Locate News Caches . . . . . . News Server Multiplexing . . . . . . . . News Database . . . . . . . . . . . . . . 4.4.1 Newsgroup and Article Database 4.4.2 Overview Database . . . . . . . . Caching Strategy . . . . . . . . . . . . . Expiration . . . . . . . . . . . . . . . . . News Cache Interfaces . . . . . . . . . . News Cache Cooperation . . . . . . . . . 4.8.1 Cascading . . . . . . . . . . . . . i

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

1 3

. 3 . 4 . 5 . 5 . 7 . 8 . 10 . 11 . . . . . . . . . . . . . . . . . .

12 13 14 16 16 16 17 18

19 19 19 21 21 22 22 23 24 25 25 26

CONTENTS

ii

4.8.2 Neighboring . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.9 Concurrent Access to the News Database . . . . . . . . . . . . . . 26

5 Design 5.1 5.2 5.3 5.4

Choice of Programming Language . . . . . . News Server Architecture . . . . . . . . . . . News Cache Architecture . . . . . . . . . . . News Server Class Library . . . . . . . . . . 5.4.1 News Server Class (NServer ) . . . . 5.4.2 Remote Server Class (RServer ) . . . 5.4.3 Local Server Class (LServer ) . . . . . 5.4.4 Cache Server Class (CServer ) . . . . 5.5 News Database . . . . . . . . . . . . . . . . 5.5.1 Non Volatile Container Class Library 5.5.2 Active Database . . . . . . . . . . . . 5.5.3 Overview Database . . . . . . . . . . 5.5.4 Article Database . . . . . . . . . . .

6 Implementation

6.1 Multiprocessing Environment . . . . . . . . 6.1.1 Processes . . . . . . . . . . . . . . . 6.1.2 Threads . . . . . . . . . . . . . . . . 6.2 Programming Language . . . . . . . . . . . 6.3 News Cache Daemon . . . . . . . . . . . . . 6.4 News Server Class Library . . . . . . . . . . 6.4.1 News Server Class (NServer) . . . . . 6.4.2 Remote Server Class (RServer) . . . 6.4.3 Local Server Class (LServer) . . . . . 6.4.4 Cache Server Class (CServer) . . . . 6.5 News Database . . . . . . . . . . . . . . . . 6.5.1 Non Volatile Container Class Library 6.5.2 Active Database . . . . . . . . . . . . 6.5.3 Overview Database . . . . . . . . . . 6.5.4 Article Database . . . . . . . . . . . 6.6 Miscellaneous Classes . . . . . . . . . . . . . 6.6.1 Multiplexing List . . . . . . . . . . . 6.6.2 sstream . . . . . . . . . . . . . . . .

7 Evaluation

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

28 28 30 32 33 33 33 35 35 35 36 38 38 39

40 40 40 41 41 41 42 43 44 45 45 45 46 52 54 55 55 56 56

58

7.1 Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.2 Tested News Readers . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.3 Performance Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

CONTENTS

iii

8 Future Work

61

8.1 Improvements . . . . . . . . . . . . 8.1.1 News Database . . . . . . . 8.1.2 Caching Granularity . . . . 8.1.3 Expiration of News Articles 8.2 News Cache Extensions . . . . . . . 8.2.1 Local Newsgroups . . . . . . 8.2.2 Prefetching . . . . . . . . . 8.2.3 Oine News Reading . . . . 8.3 Extensions to NNTP . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

61 61 61 62 62 63 63 63 64

9 Related Work

65

10 Summary and Discussion A Notations

67 69

9.1 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 9.1.1 News Database . . . . . . . . . . . . . . . . . . . . . . . . 65 9.1.2 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . 65

A.1 Object Oriented Model A.1.1 Example . . . . A.2 Finite State Machine . A.2.1 Example . . . .

. . . .

. . . .

. . . .

. . . .

B NNTP and NNRP

. . . .

. . . .

. . . .

. . . .

. . . .

B.1 Di erence between NNTP and NNRP . B.2 NNTP Commands . . . . . . . . . . . B.2.1 List . . . . . . . . . . . . . . . B.2.2 Group . . . . . . . . . . . . . . B.2.3 Article . . . . . . . . . . . . . . B.2.4 Head, Body, Stat . . . . . . . . B.2.5 Post . . . . . . . . . . . . . . . B.2.6 Newgroups . . . . . . . . . . . . B.2.7 Newnews . . . . . . . . . . . . . B.2.8 Ihave . . . . . . . . . . . . . . . B.3 NNRP Commands . . . . . . . . . . . B.3.1 List . . . . . . . . . . . . . . . B.3.2 Listgroup . . . . . . . . . . . . B.3.3 Xover . . . . . . . . . . . . . .

C Installation of the News Cache

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

69 70 71 71

73 75 75 75 75 76 76 76 77 77 77 78 78 79 79

81

C.1 Setting Up the Target Directory . . . . . . . . . . . . . . . . . . . 81 C.2 Compile Time Options . . . . . . . . . . . . . . . . . . . . . . . . 81

CONTENTS

iv

C.2.1 Con gure.h . . . . . . . . . . . . . . . . . . . . . . . . . . 81 C.2.2 Make le . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 C.3 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

D Operation of the News Cache D.1 D.2 D.3 D.4 D.5 D.6 D.7

Con gVersion . NewsServers . . SpoolDirectory TimeOuts . . . CachePort . . . Retries . . . . . ServerType . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

83 83 83 84 84 84 84 85

List of Figures 2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 3.6 4.1 4.2 4.3 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 6.1 6.2 6.3 6.4 6.5 6.6

Relationship between newsgroups and articles . . . . . . . . . . An example news article (header and body) . . . . . . . . . . . A sample News network . . . . . . . . . . . . . . . . . . . . . . Growth of the Usenet in the past years . . . . . . . . . . . . . . Con guration with a server side News Cache . . . . . . . . . . . Provision of local newsgroups . . . . . . . . . . . . . . . . . . . Merging local and global newsgroups by an intermediate server . Using a News Cache . . . . . . . . . . . . . . . . . . . . . . . . Workaround to multiplex between di erent news servers . . . . . Where to locate a News Cache . . . . . . . . . . . . . . . . . . . File system space overhead . . . . . . . . . . . . . . . . . . . . . Cascading News Caches . . . . . . . . . . . . . . . . . . . . . . Notation used for the object-oriented design . . . . . . . . . . . General structure of a news server . . . . . . . . . . . . . . . . . State transition diagram for a news server . . . . . . . . . . . . Architecture of the News Cache . . . . . . . . . . . . . . . . . . Hierarchy of the News Server classes . . . . . . . . . . . . . . . News reader using RServer class . . . . . . . . . . . . . . . . . . News server using LServer class . . . . . . . . . . . . . . . . . . News cache using CServer class . . . . . . . . . . . . . . . . . . NVList 's virtual memory usage . . . . . . . . . . . . . . . . . . Class hierarchy of the Non Volatile Container classes . . . . . . Class hierarchy for the active database . . . . . . . . . . . . . . Class hierarchy for the overview database . . . . . . . . . . . . . The News Cache Daemon . . . . . . . . . . . . . . . . . . . . . Hierarchy of the News Server classes . . . . . . . . . . . . . . . Relation between news server class and the news database . . . Format of an NVContainer 's header . . . . . . . . . . . . . . . . Organization of the NVList class . . . . . . . . . . . . . . . . . Organization of the NVHash class . . . . . . . . . . . . . . . . . v

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 9 10 12 13 14 15 16 17 20 22 26 28 30 32 32 34 34 35 35 36 37 38 39 42 44 46 46 50 51

LIST OF FIGURES 6.7 6.8 8.1 A.1 A.2 A.3 A.4 B.1 B.2 D.1

Interaction of the ActiveDB class with other classes . . . . . . . . Interaction of the OverviewDB class with other classes . . . . . . Merging local and global newsgroups by an intermediate cache server Notation for object-oriented models . . . . . . . . . . . . . . . . . Example for object-oriented notation . . . . . . . . . . . . . . . . Notation used for nite state machines . . . . . . . . . . . . . . . Example to the notation of nite state machines . . . . . . . . . . Telnet session to a news server's NNTP port . . . . . . . . . . . . A sample overview record . . . . . . . . . . . . . . . . . . . . . . News server con guration . . . . . . . . . . . . . . . . . . . . . .

vi 52 54 63 69 70 71 72 73 78 84

List of Tables 2.1 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 7.1 7.2 10.1 B.1 B.2

Newsgroups and their hierarchy . . . . . . . . . . . . . Abstract and virtual methods in C++ . . . . . . . . . The header of an NVContainer . . . . . . . . . . . . . NVcontainer 's methods . . . . . . . . . . . . . . . . . . NVList 's methods . . . . . . . . . . . . . . . . . . . . . NVHash 's methods . . . . . . . . . . . . . . . . . . . . ActiveDB 's methods . . . . . . . . . . . . . . . . . . . OverviewDB 's methods . . . . . . . . . . . . . . . . . . A sample MPList . . . . . . . . . . . . . . . . . . . . . Access statistics . . . . . . . . . . . . . . . . . . . . . . Performance measurements . . . . . . . . . . . . . . . . Usenet News problems solved by the News Cache . . . Meaning of the rst digit of an NNTP/NNRP reply . . Meaning of the second digit of an NNTP/NNRP reply

vii

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

7 43 47 48 50 51 53 56 56 58 60 67 74 74

Chapter 1 Introduction Usenet News is a popular and easy way for world-wide discussion on the Internet. Newsgroups being dedicated to various topics are its basic building blocks. Users can o er information through articles (also called postings), and read other users' articles. In order to access News a frontend program called news reader (or news client ) is needed. A news reader connects to a news server, requests the list of newsgroups and presents this list to the user. When the user selects a newsgroup, the news reader requests the list of articles available in that group and presents a summary of the articles to the user. From this list the user can choose the articles he wants to read. The news reader will remember articles read by the user and will not present these articles in the next session in this group (though still available). Other functions, like grouping related articles (threading), replying to articles or sending email to the author of an article are provided too. The Usenet is composed of several news servers that store the available newsgroups and articles. Whenever an article is posted to a newsgroup on a news server, the article will be propagated to all other news servers storing the same newsgroup. This means that the article will be copied to all other news servers. On the Internet articles are usually distributed by the \Network News Transport Protocol" and in some cases still using \Unix to Unix CoPy" (UUCP). The distribution of articles is done by best a ord, but is not guaranteed. This means that an article needs not be transferred to the other news servers immediately and that a news server does not necessarily receive all existing news articles. Since the Usenet has been growing rapidly over the last years and still continues to grow, its network bandwidth requirements for newsgroup and article distribution increase. Hence, the News system will reach its limits within the next years and new mechanisms have to be found to overcome this situation. Additionally these mechanisms might even give better access to the news system for people who are connected over lower bandwidth networks, because these mechanisms reduce bandwidth while maintaining the same quality of service. In chapter 2 we will present the terminology used throughout this document. 1

CHAPTER 1. INTRODUCTION

2

A short overview of basic technologies like caching and the operational principles of the Usenet News system will also be given. In chapter 3 we will explain the problems inherent in the currently used News system. These problems include a high load on the news server in some settings and the increased network bandwidth requirements for the transmission of news. At the end of this chapter we will give an overview of possible solution strategies to reduce or even eliminate the existing problems. In chapter 4 we will analyze the News Cache in general. We will analyze which areas can bene t from the News Cache and the News Cache's requirements. Special care has been taken that the News Cache can be integrated into existing environments without need for modi cations. The architecture and design of the News Cache will be shown in chapter 5. In the rst section we will present the general design of a news server being the basic design for the News Cache. A brief overview and an explanation of the interaction between di erent modules will be given in chapter 6. The exact implementation details are documented within the News Cache's source code. In chapter 8 we will show work that will be included in future releases of the News Cache. This includes the improvement of the existing classes and the addition of new features. In chapter 9 we will present related work that has been done in this area and in chapter 10 we will give a summary of this thesis.

Chapter 2 Terminology and Basic Technologies This chapter introduces the basic terminology used throughout this master thesis. This is necessary to provide an easy di erentiation between similar but nevertheless di erent terms. We also explain the principle of caching and will give some insights into the News system.

2.1 Terminology

News Server A server that stores newsgroups and articles and is responsible

for the distribution of newly arrived articles. The news server itself has no caching or proxying functionality. News Reader (Client) The news reader is a program used to access news articles. The news client requests newsgroups and articles from the news server and presents these articles to its user and provides a user friendly interface. News Feed A news server that provides (feeds) newsgroups and articles to other news servers. Leaf Node News Server A leaf node news server is a news server being connected to only one other news server (to its only news feed). Thus a leaf node news server is not responsible to exchange news articles with many other news servers, but only exchanges news articles with its news feed. Cache Server A cache server speeds up requests to repeatedly requested objects. Whenever a request is done by a client, the Cache Server checks whether the requested data have already been cached by the News Cache. Otherwise the Cache Server retrieves the requested data from the original server and stores them locally. Then the requested data are sent to the 3

CHAPTER 2. TERMINOLOGY AND BASIC TECHNOLOGIES

4

client. Hence, successive requests to the same data reduce the required network bandwidth between the News Cache and the original server. Proxy Server A proxy server passes requests from its clients to another server and all answers being received from the server will be passed back to the proxy's clients. This con guration makes sense if a client cannot reach a given server directly (e.g., in case of a rewall). In this situation clients can make use of the service provided by the server indirectly via the proxy server. Frequently a cache and a proxy server are combined. Global Newsgroups Newsgroups are denominated as global, if they are accessible world-wide. This implies that the newsgroups are distributed to news servers that are part of di erent organizations. Local Newsgroups A newsgroup is called local, if it is not distributed to other news servers. This is useful if the newsgroup should provide a local discussion forum that may even contain sensitive data. Hence the newsgroup's articles must not be distributed to other news servers.

2.2 Caching Caching is a well established technique to reduce the retrieval time and network bandwidth for repeatedly requested data. After the rst request of the data it will be stored (cached ) on a medium that allows quick provision of the data to the requester (at least faster than the time required for the satisfaction of the original request). Repeated requests to the same data will be satis ed by the caching medium. The caching medium is dependent on the application domain [Tan96]. In the network area caching does not only reduce the transmission time, it also reduces the required network bandwidth for repeatedly requested data. Assume that some users are located in the same area and that they require the same object stored in some distant location. These users may either request it directly from the distant location or use a cache server located \closer" to them. The rst time the data is requested from the cache server, the cache server retrieves the object from the distant location and stores it. Successive requests will be handled from the cache server's copy. To assure that the cache has an up to date copy of the document control messages have to be exchanged between the cache server and the original object's server. These messages are usually small compared to the size of the original data. If several cache servers are available, they may work together to achieve even higher bene ts. If a cache server does not store the requested data, it may ask for the data in other related caches. Using related caches in a hierarchical structure is also known as cascading. Another technique called neighboring means to combine

CHAPTER 2. TERMINOLOGY AND BASIC TECHNOLOGIES

5

several caches into one big virtual cache. If some data are requested, each cache server can ask all other caches for the requested data.

Cascading The caches build up a hierarchy (e.g., a tree) and if one cache server

does not have the requested data, it asks its parent server. However, the height has to be balanced. A high hierarchy usually brings very few bene ts, since it takes too much time to go through all levels possibly up to the root server. Neighboring This strategy combines several caches into one big virtual cache. Each cache may access the data stored in the other caches. This further improves the access time, if the requested data are available in one of the other caches and the link to these caches is faster than the link to the server holding the original data. Both techniques described here are already used in the \World Wide Web" (WWW) for the \Hyper Text Transfer Protocol" (HTTP) and the \File Transfer Protocol" (FTP). Cascading has been introduced by the CERN WWW server ([Nie96] and [W3C95]), which was the rst cache server (and even the rst WWW server) available for the WWW. Neighboring has been introduced by the Harvest Object Cache [BDH+94], which has evolved into the Squid software ([Pea97] and [Wes97]).

2.3 The Usenet News System The \Network News Protocol" (NNTP) has been standardized in RFC 977 [KL86]. Extensions made to this protocol are explained in [Bar97]. A short overview of NNTP (and NNRP, an extension for news readers) will be given in appendix B. The format of news messages has been speci ed in [HA87]. The Usenet News System currently consists of over 30000 newsgroups organized in a hierarchy, which is represented by a dot notation. For example, newsgroups starting with news. deal with the News System itself and newsgroups starting with comp.lang. deal with computer languages. Table 2.1 shows a small part of this hierarchy. This arrangement allows to quickly identify and subscribe to the newsgroups one is interested in.

2.3.1 How Newsgroups and Articles are Related

Each newsgroup holds news articles. A news article may be stored in one or more newsgroups. Figure 2.1(a) shows an example for the relationship between newsgroups and articles, while gure 2.1(b) gives the according general entity entity relationship (EER) diagram [EN94].

CHAPTER 2. TERMINOLOGY AND BASIC TECHNOLOGIES

Newsgroup: at.news

Newsgroup: news.answers

T. Gschwind T. Gschwind Operator ...

N. User O. Guru N. User G. Pelz

What is News Re: What is News Are there any rules? Re: Are there any rules?

...

...

Announce: News Cache Bugfix for News Cache Downtime ...

Newsgroup: news.software.nntp T. Gschwind O. Guru S. Barber T. Gschwind ...

Announce: News Cache Bug in News Cache Re: Bug in News Cache Bugfix for News Cache ...

(a) Sample Newsgroups and Articles Article Newsgroup Name ...

Id Author Subject Contents ...

(b) Relationship as EER-Diagram

Figure 2.1: Relationship between newsgroups and articles

6

CHAPTER 2. TERMINOLOGY AND BASIC TECHNOLOGIES Name alt. biz. comp. news. rec. sci. soc. soc.culture. soc.culture.austrian talk.

7

Purpose The alternative hierarchy. Everybody is allowed to create a newsgroup within this hierarchy. Covers business related stu . This hierarchy deals with all aspects of computers. This hierarchy deals with the news system. Recreational hierarchy. Hierarchy for scienti c newsgroups itself. Deals with society in general. Deals with the society's culture. This newsgroup deals with the Austrian society's culture. General talk.

Table 2.1: Newsgroups and their hierarchy The following types of newsgroups exist.

Reading and Posting allowed Everybody is allowed to read articles from or

post articles to these newsgroups. Moderated These groups can be read by everybody. However, articles being posted to this group are sent to the group's moderator, who decides whether the article's contents suits the newsgroup's topic and should be posted. Read-Only These groups can only be read by ordinary users. Posting articles to these groups requires some kind of authorization.

2.3.2 Articles

An article is a piece of information submitted by a user to one or more newsgroups (articles submitted to several newsgroups are called crosspostings). Articles are also called postings and the submission of an article frequently is called to \post an article". Each article carries several identi ers:

 An identi er globally unique within the News system (in the form serverid@server ).

 An article number being unique within the article's newsgroup. This identi er depends on the arrival order of the articles and is di erent for the same article on di erent news servers. If the article has been posted to several newsgroups (crossposted), the article has several article numbers.

CHAPTER 2. TERMINOLOGY AND BASIC TECHNOLOGIES

8

Articles start with header lines, followed by a blank line and by the message body. Each header line consists of a keyword, a colon, a blank, and some additional information. The exact description of the format of an article can be found in RFC 1036 [HA87]. The header contains at least the following header lines.

From The \From" line contains the electronic mail address of the person who

sent the article. Date The \Date" line is the date and time when the article was originally posted to the network. Newsgroups The \Newsgroups" line speci es the newsgroup or newsgroups to which the article belongs. Subject The \Subject" line gives a short summary of the contents of the article to enable a reader to make a decision based on the subject whether to read the article. Message-ID The \Message-ID" line gives the article's unique identi er. To ensure the uniqueness of the Message-ID it may not be reused during the lifetime of the article. Path This line shows the network path the article took to reach the current system. When a system forwards the article, it should add its own name to the list of systems in the \Path" line. The sample article shown in Figure 2.2 has the subject News Cache Released and has been posted by the user gschwind from host w5.infosys.tuwien.ac.at.

2.3.3 News Servers

Only newsgroups and articles stored on the news server can be retrieved by a client. This implies that each news server has to maintain a database storing its available newsgroups and articles. News servers are responsible to index and expire their articles. The quality of this service depends on the news server's software and its con guration. Newsgroups may be available on one or more news servers. If newsgroups shall be available on other news servers too, the newsgroups and their articles must be propagated to those news servers. On the Internet those newsgroups and articles are generally distributed using the \Network News Transport Protocol" (NNTP). The administrator of a news server decides which newsgroups can be retrieved by other news servers and which newsgroups will be requested from other news servers. Most newsgroups and articles are usually distributed globally by default.

CHAPTER 2. TERMINOLOGY AND BASIC TECHNOLOGIES

Path: snoopy.tom.priv.at!snoopy.tom.priv.at!tom From: [email protected] (Thomas Gschwind) Newsgroups: comp.os.linux.announce Subject: News Cache released Date: 31 Dec 1996 11:33:58 GMT Organization: Technische Universitaet Wien Lines: 7 Message-ID: NNTP-Posting-Host: w5.infosys.tuwien.ac.at Hello! The News Cache is a fully featured server for caching of news articles and groups. Get it from http://www.infosys.tuwien.ac.at/~gschwind/newscache/ Thomas

Figure 2.2: An example news article (header and body)

9

CHAPTER 2. TERMINOLOGY AND BASIC TECHNOLOGIES

10

Only newsgroups of questionable value or local interest are not distributed to all news servers. Clients may read news either using NNTP or the \Network News Reader Protocol" (NNRP). The latter is selected using NNTP's mode reader command. NNRP provides only those commands of NNTP being necessary for the retrieval and submission of news articles from the client's point of view. In addition it provides other commands to retrieve a summary or to lter speci c information of the available articles. See appendix B, [KL86] and [Bar97] for a detailed discussion.

2.3.4 Distribution Strategy for Articles and Newsgroups

Figure 2.3 shows a small sample news network and possible NNTP- and NNRPconnections. Client 1.1

Client 1.2 Client 2.1

Client 1.3

News Server 1

News Server 2

Client 2.2

Client 1.4 Client 2.3

News Server 3 Client 3.1

Client 3.2

News Server 4 Client 4.1

Client 4.2

Possible NNTP-Connection Possible NNRP-Connection

Figure 2.3: A sample News network If an article is posted to a news server the article has to be propagated to all other news servers that hold the according newsgroup. Assume that an article is posted to News Server 4 by Client 4.1 into a newsgroup stored on all news servers. Initially the article arrives at News Server 4, which propagates the article to the News Servers 1 and 2. Then News Server 1 will propagate it to News Server 3 and 2. The propagation to News Server 3 is successful. However News Server 2 refuses to accept the article since it has already stored this article. News Server 2 can identify this situation, because each article contains a globally unique identi er (as shown in section 2.3.2).

CHAPTER 2. TERMINOLOGY AND BASIC TECHNOLOGIES

11

The creation of a newsgroup is done by sending a specially formatted article to a special newsgroup (this newsgroup is usually called control). Creation of newsgroups is distributed in exactly the same way as articles are being distributed.

2.3.5 News Readers

To access articles stored on a news server the user needs a news reader program. The news reader retrieves the articles from the news server using NNRP. The news reader acts as frontend to the news server's database and stays connected to the same news server during the user's session. The news reader is responsible for displaying a list of available newsgroups. Then the user can select the newsgroup he wants to read. When a newsgroup has been selected the news reader shows a list of available articles in the selected newsgroup. Now the user can select the articles he intends to read. To recognize which articles have been read by the user most news readers need to connect to the same news server always. This is necessary, because many news readers use information depending on the article's arrival order on the news server, which is di erent for di erent news servers.

Chapter 3 Problem Domain Usenet has been growing over the last years and continues to grow. As shown in Figure 3.1 the number of articles and newsgroups nearly doubles each year. The growth of the Usenet as depicted in the gure has implications on the requirements for the machine's hardware running the news server software, the operating system and the attached network. [Newsgroups*10^4] [GigaBytes/Day] 4

Apr. ’97

3 2 1 [Year] 94

95

96

97

Number of newsgroups GigaByes per day

Figure 3.1: Growth of the Usenet in the past years Since articles are distributed via copying at least 400KBit/s of network bandwidth have to be dedicated for the transmission of the news articles permanently. This value does not include the overhead generated by NNTP's communication message. Hence the actual network bandwidth required for a full news feed is even higher. The machine running the news server needs powerful hardware. Fast CPUs are required since large news database indexes have to be generated. Big, fast and high quality hard-drives have to be used because many articles have to be stored, The values given here have been taken from the news server at \Technische Universitat Wien". 

12

CHAPTER 3. PROBLEM DOMAIN

13

many accesses to articles have to be ful lled, and hence considerable strain is produced on the hard disk. These demands increase with the steady growth of the Usenet. Since each article is stored in its own le a better lesystem is required, because more articles have to be stored. Many lesystems have the problem that only a xed number of les may be stored. Hence the maximum number of les may be reached although free disk space is available. The operating system has to support a fast le-locking mechanism (to ensure mutual exclusion), otherwise it is possible that a news server cannot catch up a backlog of the news feed. We will explain this in section 3.1 in more detail. An extension to NNTP might reduce these requirements on the network bandwidth and the computer hardware. In some situations a news server can be replaced by another server which does not store the whole news spool. In the following sections we will illustrate a few problem scenarios where an alternative solution or an extension to NNTP might provide the same service, but using fewer resources.

3.1 Catching Up a Backlog A backlog of the current newsfeed can occur if the news server has shut down (e.g. due to maintenance work) or lost the network connection for some time. This backlog is dicult to catch up during regular operation, because of the heavy I/O-load caused by the update process parallel to the news reading clients and the normal news operation. global.*

Client

News Cache Server

Client

Low Bandwidth Link NNRP-Connection

Figure 3.2: Con guration with a server side News Cache The I/O-load on the news server can be reduced by reducing the number of accesses to the news database, which can be accomplished by the provision of a server side news cache. Clients will connect to the news cache instead of the news server as shown in Figure 3.2. The cache server can identify duplicate requests

CHAPTER 3. PROBLEM DOMAIN

14

and issues these requests to the news server only once, thus reducing the number of requests made to the news server's news database.

3.2 Provision of Local Newsgroups With the software currently available it is expensive to provide local newsgroups in addition to the newsgroups provided by the feeding news server. It seems obvious to set up another news server providing the local newsgroups and let the news reader decide which news server to contact for which newsgroups. This con guration is shown in gure 3.3(a). This method does not require access to the news server holding the global newsgroups. Unfortunately most news readers do not support news retrieval from di erent news servers. global.*

Computer Science Dep.

Maths Dep.

local.cs.*

local.math.*

Client

Client

(a) News reader multiplexes between local and global news servers global.*

Computer Science Dep.

Maths Dep.

global.* local.cs.*

global.* local.math.* Client

Client

(b) Local news server holds local and global news groups News Server

NNTP-Connection NNRP-Connection

Figure 3.3: Provision of local newsgroups Thus, the only way to accomplish this is to set up a news server holding

CHAPTER 3. PROBLEM DOMAIN

15

both the local and the global newsgroups. Figure 3.3(b) shows this con guration. This method requires neither access to the news server holding the global newsgroups nor a news reader being able to distinguish between di erent news servers. However, this method has some disadvantages.

 The administrator of the local news server has to manage a news spool

for the global newsgroups in addition to the local newsgroups. This extra management work is a time consuming task.  More network bandwidth is required, because the articles of all global newsgroups have to be transferred to the local site, even if these articles will never be accessed by the users of the local news server. This is even worse if the connection to the feeding news server is fast enough to retrieve the articles on demand only.  The local news server requires more hardware: more disk space, because all articles have to be stored, a more powerful processor, because the articles in the local and the global newsgroups have to be indexed and even a better network adapter is required, because increased network trac has to be handled. This proves that it is expensive to operate a news server with local and global groups just for the provision of local newsgroups. Hence a better solution has to be found where local and global groups can be combined without setting up a news server with a full feed and without modi cation to any news server or reader. Figure 3.4 outlines such a con guration. global.*

Computer Science Dep.

Maths Dep.

local.cs.*

local.math.*

Client

Client

Figure 3.4: Merging local and global newsgroups by an intermediate server

CHAPTER 3. PROBLEM DOMAIN

16

3.3 News over Networks with Limited Bandwidth Another problem arising with the Usenet's steady growth is the provision of a full news feed to a news server being connected by a low bandwidth link (e.g., a modem line or ISDN). The required bandwidth for a full news feed currently is about 400KBit sy for 24 hours a day. This requirement makes it impossible to provide news to a large number of people over a low bandwidth network. A cache server has to be used in this situation to reduce the network trac caused by repeatedly requested articles. This setup is shown in gure 3.5. =

global.*

Client

News Cache Server

Client

Low Bandwidth Link NNRP-Connection

Figure 3.5: Using a News Cache

3.4 Solution Strategies The previous sections have shown drawbacks of the provision of local newsgroups and provision of Usenet News to areas with limited network bandwidth. The following sections will give an overview of possible solution strategies for these drawbacks. Among these solutions we will present our News Cache and will explain the bene ts of this approach brie y. The News Cache itself will be discussed in the following chapters in detail.

3.4.1 Server-Multiplexing by the Client

This demand can be solved by patching all existing news readers. Unfortunately it is obvious that this is impossible. However, this would be the cleanest approach, because the client would decide which server to contact for which newsgroup without requirements for additional software. A possible solution is to keep a con guration le for each news server and to recon gure the news reader for y

News server at the \Technische Universitat Wien".

CHAPTER 3. PROBLEM DOMAIN

17

each session. Figure 3.6 shows a possible workaround for news readers storing their con guration in ~/.newsrc. This workaround works for most Unix and Unix-like systems. Similar workarounds exist for other news readers and other environments. #!/bin/sh MYNS=tin if [ "arg$1" = "arg-ns" ]; then NNTPSERVER=$2 touch ~/.newsrc-$NNTPSERVER rm ~/.newsrc ln -s ~/.newsrc-$NNTPSERVER ~/.newsrc shift shift fi exec $MYNS $*

Figure 3.6: Workaround to multiplex between di erent news servers

3.4.2 Server-Multiplexing by the News Cache

The workaround presented in section 3.4.1 does not work for all news servers and is inconvenient to the user in several ways. On the one hand if the user wants to read a newsgroup from another server, he has to terminate the news reader and restart it. On the other hand the presented workaround reduces the functionality of some news reader to work on several newsgroups simultaneously. This functionality will not be available for newsgroups stored on di erent news servers. The News Cache can include the functionality to contact di erent news servers for di erent newsgroups. Whenever a news reader selects a newsgroup from the News Cache, the News Cache decides which news server to contact for the given newsgroup. All this will be done transparently to the news reader. Another advantage of this approach is that the user needs not to worry about which newsgroups are provided by which news server.

CHAPTER 3. PROBLEM DOMAIN

3.4.3 News on Networks with Limited Bandwidth

18

This problem can only be solved for leaf node news servers without modi cations to NNTP. Since leaf node news servers need not hold the whole news feed, they can be replaced by a News Cache. The News Cache retrieves only those articles that are requested by a news client. Thus the News Cache eliminates the transmission of articles that will never be accessed by any news client. To solve this problem in general it is necessary to add some new commands to NNTP. One command could be a request to transfer the articles in compressed form. This could reduce the required network bandwidth by approximately 50%. Another command could request the transfer of all newly arrived articles smaller than a given size. This is useful, because it does not pay to negotiate the transmission of small articles and the transmission of larger articles (e.g. those found in the binary newsgroups) will be negotiated. An explanation of the currently used article negotiations can be found in [KL86].

Chapter 4 Analysis and Requirements In the previous chapter we have shown problems of the currently existing News System and have given a short overview of possible solution strategies. Among these solution strategies we have already shown our idea. We think that most of these problems can be solved using our News Cache. In this chapter we will explain the News Cache's principles. We will analyze which areas can bene t from the use of a news cache and what caching strategies can be used. Later in this chapter we will discuss its requirements, e.g. the caching strategy, the interface to the clients or to the news server, etc.

4.1 Cache Accuracy The News Cache gets requests for newsgroups and articles from its clients. If the requested data are already stored in the cache, the News Cache can respond to the request directly. Otherwise the news cache requests these data from its news server, stores them, and sends a copy to the requesting client. Hence successive references to the same data can be answered by the cache without consulting the news server. Control messages are required to ensure that the cached data are accurate. However, it is not necessary to issue control messages after each client's request, because it is acceptable to provide new articles with a small delay and because News is not a highly dynamic medium. Hence, the control messages may be omitted for newsgroups whose contents have been veri ed a certain time period (e.g., a few minutes) ago.

4.2 Where to Locate News Caches As pointed out in section 3.4.3 a News Cache can be used in areas with an unreliable or slow Internet link (Figure 4.1(a)). In this case the News Cache reduces the transferred data to the absolute minimum. Only those articles being requested 19

CHAPTER 4. ANALYSIS AND REQUIREMENTS

20

by a news client will be transferred from the feeding news server. Articles will be retrieved from the news server only once, since articles that are requested repeatedly are already stored in the News Cache's news database. The reduction of the network bandwidth can be calculated using the following formula, where is the number of accesses and the size of article . ni

szi

Gain

=

X

( ? 1)  ni

szi

i

[Bytes]

A News Cache can replace a leaf node news server since a News Cache requires less hardware and less network bandwidth. Figure 4.1(b) shows a scenario, where a leaf node news server is replaced by a News Cache. News Server News Server

News Server News Server

News Cache Client 1

News Cache Client 3

WS 1

Client 2

(a) Slow link to news server News Server

Client 2

News Server

News Cache

Client 3

(c) Server multiplexing Possible NNTP Connection Possible NNRP Connection Clients reading news

WS 3

(b) Replacing a leaf node news server

News Server

News Cache

Client 1

WS 2

Client 1

Client 2

News Cache

Client 3

Client 4

(d) Reducing the server load of the main news server High speed link (e.g. LAN, ATM, ...) Slow link (e.g. modem line)

Figure 4.1: Where to locate a News Cache The News Cache reduces the required network bandwidth between the news cache (the former leaf node news server) and its feeding news server. The news cache also reduces the hardware requirements, because it uses less disk space and

CHAPTER 4. ANALYSIS AND REQUIREMENTS

21

less CPU time than a news server (the hardware requirements necessary for a news server have been explained in chapter 3 in detail.). To use fewer hard disk space the News Cache has to expire older articles. In case an expired article will be requested again, it has to be requested from the news server again. The News Cache can be used to multiplex between di erent news servers. This can be necessary if a user needs to access newsgroups from both news servers (e.g., because one server holds the local news groups and the other the globally accessible groups) and his news reader is not able to distinguish between di erent news servers. In this situation the user connects his news reader to the News Cache instead of connecting it to the news server directly. The News Cache can also reduce the load of heavily used news servers (Figure 4.1(d)). This reduces the problem to catch up a backlog. (see section 3.1 for an explanation of this problem). Instead of directly connecting to the news server the user's news reader connects to one of the news caches. Users may connect to whatever cache they like, because the contained articles have the same article numbering as the news server itself. However, the cache server cannot replace non leaf node news server. These are news servers with links to several other news servers. Non leaf node news servers collect data from many news servers and have to store all arriving articles for distribution. Since the News Cache does not include this functionality, it is impossible to replace these news servers by a News Cache.

4.3 News Server Multiplexing The News Cache presented in this thesis shall allow an easier provision of local newsgroups as shown in section 3.2. Hence it must be able to distinguish between di erent news server. Hence the News Cache has to maintain di erent news databases for di erent news servers and has to select the appropriate news database based on the newsgroup's name. Alternatively the News Cache may maintain only one news database. However in this case the di erent news databases have to be merged. Since this can be a time consuming task we decided to maintain a separate news database for each news server.

4.4 News Database Data cached by the News Cache have to be stored in a database. Hence the News Cache has to maintain a news database consisting of the newsgroup, article and overview database similar to the database of the news server.

CHAPTER 4. ANALYSIS AND REQUIREMENTS

22

4.4.1 Newsgroup and Article Database

Standard news servers like INN [Sal92] and CNews [CS] use the local le system to store their newsgroups and articles. Each newsgroup uses its own directory and the articles are stored using a le for each article. To provide a unique le name for each article, the article number is being used for the le name. This straightforward approach has several disadvantages:

 News articles are usually very small (about 1{4 Kilobytes). Disk space is

allocated in blocks which are typically between 512 and 4096 Bytes. Thus disk space is wasted depending on the block size (approximately between 10% and 25%, as shown in Figure 4.2).  The total number of les is limited in most le systems. Thus free disk space may be available, but cannot be allocated if the maximum number of les has already been reached. 4096

i(s) . . . Minimum Space required for les u(s) . . . Space allocated for les

[Bytes Used]

3072

R

Waste = (u(s) ? p(s))ds [Bytes]

2048 1024 [Size in Bytes] 0

1024

2048

3072

4096

Figure 4.2: File system space overhead Hence, a new and better approach should be found for the News Cache and future news servers. An alternative approach is the use of a database for each newsgroup, where all the newsgroup's articles are stored. However, this requires the availability of a powerful database. Nevertheless we adopted this approach. On the one hand the News Cache does not store as much articles as the news server does and on the other hand we did not have the resources and time to design a new article database.

4.4.2 Overview Database

The overview database provides a short summary of the articles available within a newsgroup. The news server can compute this data when the clients requests it since all articles are stored within the news server's database. The News Cache does not necessarily store all the articles of a newsgroup. Therefore the News Cache cannot compute this database on the y using the article database. Hence, we will have to use another database storing the overview

CHAPTER 4. ANALYSIS AND REQUIREMENTS

23

informations for each article. The design and implementation of this database will be presented in section 5.5.3 and section 6.5.3, respectively.

4.5 Caching Strategy The caching strategy determines how the News Cache handles read and write requests. Read requests are usually ful lled immediately. It is obvious that delaying a read request is undesired, because the client needs to poll for the data since they are available. In the case of the News Cache, the News Cache has to inform its user that the requested data are currently not available, but it will be requested from the actual data source. This strategy is not satisfactory for the following reasons:

 The user has to select a newsgroup twice. The rst selection is necessary

for the cache to request the data of interest, but does not provide any information to the user.  This strategy is dicult to implement, because no satisfying solution exists to inform the user that the articles of the newsgroup are not cached currently. This could be done by using a fake article informing the person reading news that the newsgroup's articles will be available in some minutes. However this will confuse the news reader's database of read and unread articles.  The cache server should be fast and only small delays should occur for data not available in the news cache. The News Cache presented in this thesis ful lls read requests immediately. Similar strategies may be used for post requests. Either they may be ful lled immediately like read requests or they may be spooled and ful lled later on. It does not matter if post requests are spooled and ful lled with a delay, because usually the reader does not need to access the article immediately after it has been posted. For postings the following strategies can be distinguished:

Post Through means that whenever an article is posted, it is posted directly

to the news server and not stored within the News Cache. The advantage of this method is its simplicity. The News Cache does not need to maintain the postings in any way. However, if the connection to the news server is lost, the posting of the article will fail and it has to be reposted by the user. Bu ered Post Through eliminates the problem of the previous strategy, because all post requests are stored locally. As soon as a connection is possible to the news server, the article will be posted. This gives the impression to

CHAPTER 4. ANALYSIS AND REQUIREMENTS

24

the client that the posting succeeded even if the request cannot be processed immediately. However, articles spooled to post to the news server are not available from the cache till the article has been posted to the news server, because post requests are stored in a queue independent of the newsgroups. Bu ered Post Back eliminates the small problem of the bu ered post through strategy at the cost of additional complexity. Posted articles that are only stored on the cache server can already be served to its clients, because they are merged into the local article spool. Whenever an article is posted, it is immediately accessible via the requested newsgroup and has to be assigned an article number. However, this means that the news cache has to use its own article numbers for all of its articles and has to maintain a mapping between the article numbering on the News Cache and the news server. We think that the News Cache should use either Post Through in combination with Bu ered Post Through as fallback strategy in case the news server cannot be reached or Bu ered Post Through. We discourage the use of Bu ered Post Back strategy. Since this method uses local article numbers depending on the arrival order of the articles, the user cannot switch between di erent News Caches attached to the same news server. The problem arises from the fact that most news readers need the article numbers to remember which articles have been read. Compared to the Post Through strategy Bu ered Post Back has the advantage that post requests can always be handled. If the news server is accessible they will be sent to the news server directly. Otherwise, they will be spooled till the news server is accessible.

4.6 Expiration If the News Cache is running for a longer time the cache's database grows and older articles must be eliminated to free disk space. The following strategies are available for the replacement of older articles.

Random expires news articles in a random manner. This method is rather

unintelligent and should be avoided. Least Recently Used expires the articles that have not been accessed for the longest period rst. Least Frequently Used expires articles that have been least frequently used rst, because those articles are least likely accessed again.

CHAPTER 4. ANALYSIS AND REQUIREMENTS

25

Oldest Article First expires the oldest article rst. These articles are unlikely to be accessed again, because they have already been read by most news readers and they will expire rst on the news server. Biggest Article First expires those articles with the biggest size rst. This eliminates articles usually found in binary groups. The elimination of these articles costs the least time and brings the most disk space bene t.

In the current version of the News Cache we decided to use the Oldest Article First strategy. Statistics for these expiration methods are required for a detailed discussion. This will be done in a future work (see section 8.1.3).

4.7 News Cache Interfaces Special care has to be taken for the de nition of the News Cache's interfaces to news readers and servers. Otherwise we would have to adapt existing programs in order to allow them to cooperate with the News Cache. However, it is obvious that this is impossible to adapt all existing programs and therefore unacceptable. Hence we have to use protocols already existing for the available news server and news readers. Otherwise the acceptance of the News Cache would be too low.

Client Interface The news cache has to provide the same interface to its clients as a news server. This is the only way to ensure that existing news readers can use the News Cache without any modi cations. The user of the news reader only needs to con gure the news cache as its news server. Server Interface Since we cannot de ne a new protocol for the News Cache, the News Cache has to use the currently available commands to retrieve articles from its news server. If we use the commands available to a news reader, we have ensured that the news server needs not be changed in any respect.

4.8 News Cache Cooperation The simplest con guration is using a single cache server attached to a single news server and having several clients attached to the news cache. Other cooperation patterns among the caches are hierarchical news caches or a cooperating pool of caches. These organization types are called cascading and neighboring, respectively. Both patterns are already in use on the World-Wide Web. For a detailed discussion see [BDH+94]. We will discuss the usability and implications of these cooperation patterns in the following sections.

CHAPTER 4. ANALYSIS AND REQUIREMENTS

26

4.8.1 Cascading

Cascading means to build a tree of news caches as shown in Figure 4.3. If a news cache at a certain layer does not provide the requested data, it forwards the request to its upstream news cache and so on. Since the news cache acts as an NNRP server and uses NNRP commands on the server side setting up a tree of news caches is easy. News Server

News Cache 1

News Cache 3

News Cache 4

News Cache 2

News Cache 5

News Cache 6

Figure 4.3: Cascading News Caches A high tree usually brings only minor bene ts, because the higher level caches may have to store so many articles that they may be replaced by ordinary news servers. Also the performance might degrade, because a request for an article will possibly have to be processed by each news cache possibly up to the root of the tree. Under these circumstances it is often better to set up a news server instead of one of the higher level caches and attach the other caches to this news server.

4.8.2 Neighboring

Neighboring is a technique where the cache server checks whether the requested document can be found in other caches before asking the information server. If one of these neighboring caches stores the data, the data will be retrieved from that cache server. We think that it is not necessary to consider this structure for a News Cache currently. The infrastructure provided by the currently existing news servers is very dense and it is faster to query a news server than the neighboring caches.

4.9 Concurrent Access to the News Database The news cache has to support simultaneous access from multiple clients. Hence it is necessary that multiple instances of the News Cache are running which access the News Cache's news database. This may be achieved by either using a single process for each client or by using a threaded cache server. We will discuss the

CHAPTER 4. ANALYSIS AND REQUIREMENTS

27

pros and cons of this topic in section 6.1 in more detail. However, both solutions have to support concurrent access to the news database. Since concurrent access to the news cache database has to be provided, mutual exclusion has to be guaranteed for update requests. Whenever an instance of the News Cache needs to update the news database, the news database has to be exclusively locked for the update operation. We will discuss this problem in detail in section 6.5.

Chapter 5 Design The notation we will use to describe object hierarchies is taken from [Boo94]. Figure 5.1 gives a short overview of the symbols. A detailed description of the notation is given in appendix A. Class Diagram ClassName Attributes Operations()

Class Relationships Association Inheritance Has Using

Properties A

Abstract Class

F

Friend

S

Static Member

V

Virtual

Figure 5.1: Notation used for the object-oriented design

5.1 Choice of Programming Language We cannot decide for a programming language based on the News Cache's design alone because the programming language has to ful ll the following criteria:

Availability The programming language must be available on a wide set of

platforms since the News Cache should be available for as many platforms as possible. In addition free implementations of the chosen language should exist, since we want to make this software freely available under the terms of the Gnu Public License (). Acceptance The programming language has to be widely accepted since the user installing the News Cache should be able to compile the source code (binaries cannot be made available for all systems) and be able to apply minor code xes in case of problems. 28

CHAPTER 5. DESIGN

29

Optimization Compilers producing highly optimized code should be available.

Interpreted languages usually have a worse performance, because of the interpreter's overhead. Hence we will not consider languages where only interpreters are available (such as Perl [WCS96], Python ([vR97] and [Lut96]), Tcl [Ous94] and others). The following languages ful ll these requirements. C The GNU C Compiler, a freely available and well optimizing C compiler as well as commercial C compilers are available for nearly any platform. C has the advantage that it is widely accepted and that most NNTP Software (e.g., INN [Sal92], tin, etc.) is written in C. One disadvantage of C is that it is neither object-based nor object-oriented. This makes it dicult to write objects with similar behavior without unnecessarily duplicating the source code. To test the suitability of C we wrote a small news cache with very basic functionality only. Even this small program got rather complex. We had to decide whether to implement lists for several types or to use only one list structure and make heavy use of the cast operator. Both methods are error prone. C++ The GNU C++ Compiler, a freely available, optimizing C++ compiler as well as commercial C++ compilers are available for most platforms. C++ has the advantage that it is object-oriented and allows to de ne algorithms and objects in a type independent way using templates. Unfortunately C++ is not as widely available and accepted as C. Since C++ has completely been standardized only a short time ago [ES90], many C++ compilers do not implement all aspects of the C++ standard. To test the suitability of C++, we ported our test program written from C to C++. As expected the code to handle di erent types of lists got smaller and was easier to understand. Java Java is a relatively new programming language that has been introduced by Sun Microsystems ([GJS97] and [AG97]). Java is similar to C++, except that pointers have been eliminated and the explicit memory allocation has been replaced by an implicit one as found in Lisp or Prolog. Its advantage is that compilers for an abstract machine and for native code exists. This means that the News Cache can be run without compiling the source code on any computer where the interpreter for the Java Virtual Machine is available. However, with the cost of much performance. For our implementation we have chosen to use C++ in favor of Java, since C++ compilers are wider available and have a much better performance and acceptance.

CHAPTER 5. DESIGN

30

Other languages like Objective C, Smalltalk , Pascal, Modula 2, and similar languages have not been considered, because of their restricted availability and acceptance.

5.2 News Server Architecture Before we will present the object-oriented design of the News Cache and its associated classes, we will explain the overall architecture of a news server. This should help to give an easier understanding of the News Cache's architecture since the News Cache is based on a similar structure itself. NewsGroupId ArticleNumber ArticleId

Id Name Flags

Newsgroup

Id Head Body

Article

(a) Structure of a news server as EER-Diagram ActiveDB

News Database

Article

Newsgroup

OverviewDB NewsServer

News Clients

(b) Object-oriented structure of a news server

Figure 5.2: General structure of a news server Figure 5.2(a) shows the relationship between newsgroups and articles as EER diagram and suggest the use of a relational database as a simple approach. Newsgroups and articles are stored in the Newsgroup and the Article table, respectively. The relation between newsgroups and articles is stored in another table.

CHAPTER 5. DESIGN

31

The database has to be able to maintain about 35000 newsgroups, 2 5 million articles and a table mapping the m:n-relationship between newsgroups and articles with approximately 3 5 million entries. Since not everybody who wants to run a News Cache should be forced to buy such a powerful database, we have decided to design and implement our own database for the management of the newsgroups and articles. This database will partially be based on the lesystem. The design of this database will be shown in section 5.5 and its implementation in section 6.5. Figure 5.2(b) shows the architecture of a news server using an object-oriented model. Connections to the news clients are handled by the NewsServer object, while the news database is handled by other objects. The ActiveDB object is used to maintain the list of available newsgroups (Active Database) and the Newsgroup object is used for the maintenance of the currently selected newsgroup. The Newsgroup object uses an OverviewDB object to manage a summary of available articles and an Article object for the maintenance of the currently selected article. The relationships are indicated by the use of a has a association arc. The user of another object is marked using a lled circle. In the following we will describe the objects shown in Figure 5.2(b). NewsServer This object stores the current state of the news server and provides all functions necessary to retrieve articles and newsgroups. The state includes the currently selected newsgroup and the number of the currently selected article. The functions include retrieving the list of newsgroups, the overview database and the articles stored on the news server, and others. Newsgroup This object stores the articles and the overview database of the current newsgroup. It provides methods to access these informations. Article Stores a Usenet News article. ActiveDB This object stores the newsgroups available on a news server. For each newsgroup the type of the group (public, moderated, read-only) and the number of the rst and last articles are stored. It provides methods to access these informations. OverviewDB Stores a summary of articles available within a newsgroup. It provides methods to access this information. Figure 5.3 shows the behavior of a news server as a nite state diagram. In the news server's initial state no newsgroup is selected. If a newsgroup is selected by the client, the successive listgroup, xover and article commands will work on the newly selected newsgroup. :

:

The values given here have been taken from the news server at \Technische Universitat Wien". 

CHAPTER 5. DESIGN Initial State

32 select newsgroup

List Active List Newgroups Group

Group Selected List Active List Newgroups Group Article XOver

select newsgroup

Figure 5.3: State transition diagram for a news server

5.3 News Cache Architecture The News Cache consists of several modules as shown in Figure 5.4. The Cache Daemon module handles requests from clients and passes these requests to the News Cache's controlling module (the CacheServer module). The CacheServer module checks whether the requested data are available in the local database, which is handled by the LocalServer module. Otherwise the requested data will be retrieved from the news server using the RemoteServer module. News Server NNRP

RemoteServer CacheServer

LocalServer

Cache Daemon NNRP

News Clients

Figure 5.4: Architecture of the News Cache The Cache Daemon object reads NNRP requests from its clients and transforms the request into appropriate calls to the CacheServer module. The result returned will be passed back to the clients. The CacheServer module is responsible for the cache management. Whenever a request is being made to this module it checks whether the LocalServer module can ful ll the request. The LocalServer can ful ll the request if the requested data are already cached and the data have not timed out. As we have explained in section 4.1 data requested by the News Cache remains valid for a prede ned period of time. If the request cannot be ful lled by the LocalServer module the CacheServer module uses the RemoteServer module to retrieve the data. The

CHAPTER 5. DESIGN

33

fetched data are stored in the LocalServer's news database and passed back to the requester. The LocalServer module gives access to a local news database and the RemoteServer module allows the retrieval of newsgroups and articles from a news server. The architecture of the CacheServer, LocalServer and RemoteServer modules is based on the architecture of a News Server as explained in section 5.2, because they provide access to the news database in the same way as a news server does. Each module is implemented as separate class. For historical reasons these classes are called CServer, LServer, and RServer. Section 5.4 explains these classes in detail.

5.4 News Server Class Library As explained in section 5.3, the architecture of the News Cache is based on several news server objects providing an interface to di erent kinds of news databases. Each of these objects will be implemented using their own class and all these classes together will form the News Server Class Library. Figure 5.5 shows the hierarchy for the classes of the News Server Class Library. The base of all news server class types is an abstract class called NServer. This class de nes a general interface to a news database that has to be implemented by all of its subclasses.

5.4.1 News Server Class (NServer )

The NServer class does not implement the access to any news database. It de nes only the interface for the methods necessary to access a news database. The methods are de ned as virtual to allow the exploitation of polymorphism across the News Server classes. This gives the possibility to write functions and methods that operate on any news server class. Depending on the exact class given to those functions the appropriate class method will be called.

5.4.2 Remote Server Class (RServer )

The RServer class uses a news server as its news database. Newsgroups and articles will be retrieved from the news server using NNRP. Hence this class may be used in programs that need to access a news server (e.g., the news reader, proxy news server, etc.). Figure 5.6 shows this con guration.

CHAPTER 5. DESIGN

34

NServer virtual active() virtual group() virtual xoverdb() virtual article() virtual post() A

V

V

RServer

LServer

CServer

Figure 5.5: Hierarchy of the News Server classes

News Server

News Database

NNRP

RServer class News Reader

Figure 5.6: News reader using the RServer class

CHAPTER 5. DESIGN

35

5.4.3 Local Server Class (LServer )

The LServer class uses a local news database. This class may be used in programs that need to access a local news database (e.g., a news server, etc.). Figure 5.7 shows this con guration. LServer class

News Database

News Server

Figure 5.7: News server using the LServer Class to access the local news database

5.4.4 Cache Server Class (CServer )

The CServer class retrieves newsgroups and articles from a news server and caches these data in its own local news database. It is built on top of the RServer class, which is used to retrieve the data from the news server and on top of the LServer class, which is used for the local news database. Using this class it is possible to implement a news reader with a local cache or even a cache server for the News System. Figure 5.8 shows an example. News Server

News Database

NNRP

CServer class News Cache

News Database

Figure 5.8: News Cache using the CServer class for its database

5.5 News Database The Non Volatile Container Class Library and the local le system will be used to store the news database. The Non Volatile Container Class Library has been implemented, because no suitable library currently exists, that stores its data structures on non volatile memory. Special care has been taken to design this library as general as possible. The Non Volatile Container Class Library is used to store the active database and the overview database. To store news articles the local lesystem is used.

CHAPTER 5. DESIGN

36

5.5.1 Non Volatile Container Class Library

The Non Volatile Container Class Library (NVContainer Class Library) provides a set of containers storing their data structures on non volatile (external) memory (e.g., on the hard disk).y Since the NVContainer classes store their data on external memory, they reduce the virtual memory used by a process using this library. In addition this approach allows that the database provided by an NVContainer class can be used by several processes concurrently. Hence an NVContainer class can be used for the provision of \shared memory" too. The use of a memory mapped le has the following advantages and disadvantages: + The containers contents need not be written to or read from the le explicitly. + The process does not consume virtual memory for the data mapped from the le. Figure 5.9 shows this on a SunOS system for a process having allocated 5 Megabytes in an NVList. The SZ column indicates the total amount of memory used for the data and stack segments in Kilobytes and the RSS column indicates the total amount of main memory used by the process. + If a list should be shared by several processes they will use the same le for their memory. The le's contents can be shared by several processes. ? If the le size changes the memory region has to be unmapped and remapped again, since the size of the mapped region must be provided with the mmap system-call. This might decrease the performance slightly.

USER gschwind

PID %CPU %MEM 461 0.0 1.1

SZ 48

RSS TT STAT START 296 p7 S 00:02

TIME COMMAND 0:00 testNVList

Figure 5.9: The virtual memory usage of a process with an NVList of 5 Megabytes The NVContainer class consists of an abstract parent class (NVcontainer ) providing the basic functionality necessary in all its subclasses. The NVcontainer manages the memory allocation and deallocation of the non volatile memory and provides functions for locking and unlocking of the external memory. The latter Currently, the NVContainer Class Library is restricted to non volatile memory that can be memory mapped and locked by the operating system. y

CHAPTER 5. DESIGN

37

is necessary to ensure mutual exclusion between di erent processes that access the database provided by an NVContainer class simultaneously. The current version of the NVContainer Class Library provides a Non Volatile List (NVList) and a Non Volatile Hash (NVHash) class. Both are subclasses of the abstract class NVlist. NVlist is a direct subclass of NVcontainer and implements basic list operations. In future versions of the NVContainer Class Library we will also provide classes storing their data using trees, arrays and other data structures. Each container class comes with an iterator that allows to access the data stored within the class. Figure 5.10 shows the hierarchy of the classes provided by the NVContainer Class Library. Classes being planned for future releases (e.g., NVtree ) are shown using gray color. NVcontainer MemoryManagement() Locking() A

NVlist AddElement() RemoveElement() A

NVList AddElement() RemoveElement()

NVtree AddElement() RemoveElement() A

NVHash AddElement() Clear()

Figure 5.10: Class hierarchy of the Non Volatile Container classes

The NVList Class and its Iterator This class uses a list to store its data. Whenever an NVList class is instantiated it creates a new le to store its data or if the le already exists it uses the contents stored in this le. The NVListIter class is used to iterate over an NVList class.

The NVHash Class and its Iterator

This class uses a hash table to store its data. Whenever an NVHash class is instantiated a new le to hold the class's data is created. If the le already exists

CHAPTER 5. DESIGN

38

the contents of this le will be used for the hash table. The NVHashIter class is used to iterate over an NVHash class.

5.5.2 Active Database

The Active Database consists of several classes. Its classes and their relationship are shown in Figure 5.11. The abstract base class ActiveDB de nes only the interface to the active database. This is done using virtual methods to allow the exploitation of polymorphism. VActiveDB and NVActiveDB are subclasses of ActiveDB. ActiveDB GroupManipulations()

NVHash

A

VActiveDB

NVActiveDB

Figure 5.11: Class hierarchy for the active database VActiveDB implements an active database stored on volatile memory. It is used for the RServer class, because this class does not need to store its databases on external memory. NVActiveDB implements an active database stored on non volatile memory. To o er this the NVActiveDB inherits the functionality provided by the NVHash class. The class is used for the news server classes that need to maintain a local news database. These are the LServer and CServer classes.

5.5.3 Overview Database

The overview database holds a short summary for each article. The summary consists of the date, the subject, the author, the article id, the number of lines of the article, and other data depending on the news server. The overview database has been introduced to allow news readers a faster access to these data. Without this extension the news reader would have to retrieve the article header for each article, which increases the required network bandwidth and the time necessary to prepare the listing of available articles. The overview database uses the same design as the active database. It consists of an abstract parent class (OverviewDB ) de ning the interface and the subclasses VOverviewDB and NVOverviewDB.

CHAPTER 5. DESIGN

39 OverviewDB Read/Write() Expire()

VOverviewDB

NVList

A

NVOverviewDB

Figure 5.12: Class hierarchy for the overview database

5.5.4 Article Database

The article database is stored in the same format used by INN [Sal92] and CNews [CS]. Each newsgroup has its own directory named after the group name. The articles of the newsgroup are stored in its directory and each article is stored in its own le. For example, the article 567 in newsgroup linux.dev.kernel will be stored in linux/dev/kernel/a567. We adopted this scheme for the News Cache, because we did not want to use a relational database to store articles on the one hand. This has been explained in section 4.4.1. And on the other hand the News Cache does not store as many articles as a news server. Nevertheless, the format of the article database may change in future versions of the News Cache.

Chapter 6 Implementation This chapter explains the implementation decisions for the News Cache. We will show the multiprocessing environment and the programming language that we have chosen and explain the News Cache's most important classes and their interaction.

6.1 Multiprocessing Environment Simultaneous client-accesses to the News Cache must be possible. Hence, we must spawn a new process or at least a new thread for each client. Since all these processes (or threads) access common resources like the newsgroup database mutual exclusion has to be assured.

6.1.1 Processes

Processes are available on all multiprocessing platforms in a similar manner. However, processes have the disadvantage that a task switch between two processes is more expensive than a context switch between two threads. Processes do not share their data segment. Hence data that could be shared in theory would be allocated twice. As another disadvantage processes usually do not share their data structures. One possibility to provide shared memory between di erent processes is to request a chunk of shared memory explicitly. The shared memory should be requested in big chunks, since the number of shared memory segments is limited on many operating systems. In addition a large number of smaller chunks reduces the performance of the whole system. On most modern operating systems like Unix or Windows NT a system call to map the contents of a le into a process's memory space is available. Many di erent processes may map the contents of the same le into their memory space. In this case, modi cations to the le can be shared by all processes. This 40

CHAPTER 6. IMPLEMENTATION

41

method is similar to requesting shared memory, but has several advantages. Most importantly all data are implicitly stored on non volatile memory and need not be written onto a le explicitly. In addition it reduces the process's memory requirement. In both situations mechanisms have to be provided to ensure mutual exclusion. Usually, semaphores are used to ensure mutual exclusion on shared memory and le locking is used for the le-mapping approach. We think the best solution for the News Cache is the use of a memory-mapped le, because the data must be stored onto a le, too.

6.1.2 Threads

Threads have the advantage that a context switch between threads is cheaper than a task switch and that they can share their variables without explicit allocation of shared memory. Threads allow a better exploitation of parallelism, because the inter process communication is cheaper in terms of resources needed, and data sharing can be performed on a ner scale. However, special care has to be taken for the exploitation of this extra parallelism, otherwise the performance of the whole system can decrease. Unfortunately, threads are not supported by all operating systems. Although a portable Posix [Pro97] compliant thread library exists for many operating systems, this library is only installed on very few systems. Also the libraries that come along with the operating systems di er, which complicates the system's independent implementation.

6.2 Programming Language We have decided to use C++, because it allows object-oriented programming. This is useful, because the News Cache uses many similar structures. C++ gives the possibility to write those structures with minimal e ort using classes and inheritance. Additionally, C++ allows the exploitation of polymorphism across such classes. Another reason has been that optimizing and widely accepted compilers are available for C++. Section 5.1 gives additional details.

6.3 News Cache Daemon As explained in section 5.3, the News Cache consists of several modules (Figure 6.1). The Cache Daemon module interprets commands received from news clients. The heart of this module is the dispatching function (nnrpd) that decides which

CHAPTER 6. IMPLEMENTATION

42

News Server NNRP

ActiveDB

CacheServer OverviewDB

Cache Daemon NNRP

News Clients

Figure 6.1: The News Cache Daemon function is responsible for what command. This decision is based on a jump table stored in the nnrp commands variable. Since C++ does not support calling methods using jump tables (this is due to the implicit this parameter) we did not implement this module using its own class. The Cache Daemon module is only responsible for the NNRP protocol and the multiplexing between di erent news servers. The News Cache uses the CacheServer module to maintain its news database and to retrieve newsgroups and articles from the news server (Figure 6.1). The news database itself is stored using the active, overview and article database. The multiplexing between di erent news servers is performed using the MPList class. Whenever a news group is selected by the client, the News Cache uses an MPList to nd the appropriate news server. The appropriate news server is selected using the CServer 's setserver method. Since we have decided to use one process for each client, the News Cache either can be started from inetd or can run as standalone server.

6.4 News Server Class Library The following sections show the implementation of the news server classes. These classes provide an interface to several kinds of news databases. The NServer class is an abstract class that acts only as common superclass for the news server classes. For facilitating the exploitation of polymorphism the methods of the NServer class are declared as abstract and virtual. Table 6.1 gives an explanation of these concepts. The current version of the News Server Class Library already provides a class for the retrieval of news articles from a remote news server (RServer ), from a local news database (LServer ) and a class retrieving news articles from a remote news server with the ability to cache already requested data in a local news

CHAPTER 6. IMPLEMENTATION

43

Abstract Method An abstract method only de nes the interface for the

method. The actual implementation for the method must be provided by classes that inherit the method from its parent class. Using abstract methods in a class ensures that all subclasses of the base class implement at least the abstract methods provided by the base class. Virtual Method Declaring methods of a class as virtual allows to exploit polymorphism across its subclasses. Consider a class GeometricShape and some subclasses (e.g., Rectangle, Circle, . . . ) that de ne the method draw. Whenever GeometricShape is used in the place of one of its subclasses, the correct method draw will be called. Table 6.1: Abstract and virtual methods in C++

database (LServer ). The relationship between those classes has been explained in section 5.4 and is shown again in Figure 6.2. The classes of this library allow any program (e.g., news readers, proxy news servers, news caches, etc.) to access a news database in di erent ways.

6.4.1 News Server Class (NServer)

The NServer class de nes the interface for all methods needed to access a news server. These methods are declared abstract and virtual. The methods are declared abstract because the method has to be implemented by the subclass of this class. The methods are declared virtual to provide polymorphism across the news server classes. An NVActiveDB class is used for the active database and an NVOverviewDB class for the overview database. Currently these two classes have to be used by all the NServer 's subclasses. However, this is likely to change in the future (See section 8.1.1) for a detailed explanation). The following methods have been de ned for the retrieval of newsgroups articles and the overview database:

active() This method returns a pointer to the active database. group(char *name) This method returns a pointer to information about the

newsgroup name. It contains the type of the newsgroup, the number of the rst and last article and the total number of articles. xoverfmt() This method returns the order of the database entries stored in the overview database.

CHAPTER 6. IMPLEMENTATION

44

NServer virtual active() virtual group() virtual xoverdb() virtual article() virtual post() A

V

V

RServer

LServer

CServer

Figure 6.2: Hierarchy of the News Server classes

xoverdb(int fst, int lst) This method returns a pointer to the overview data-

base of the newsgroup. The overview database contains at least the overview records for the articles within the range fst{lst. article(int nbr) returns the article with the number nbr from the current newsgroup. post(Article *article) posts the article stored in article to the news database. setserver(char *ns name, char *ns serv) This method selects a new database. The parameters ns name and ns serv de ne the name of the database. For the RServer and CServer classes ns name and ns serv re ect the name and the service of the news server. This function has been added to allow multiplexing between di erent news databases. This interface may change in the future.

6.4.2 Remote Server Class (RServer)

The RServer class is a subclass of the NServer class and implements the news server interface as speci ed by theNServer class. The RServer class uses the news

CHAPTER 6. IMPLEMENTATION

45

database provided by a news server. The data on the news server is accessed using NNRP. The RServer class implements the following methods in addition to the methods speci ed in section 6.4.1:

connect() This method connects to the news server speci ed by the setserver method.

is connected() This method returns, whether the RServer class currently is

connected to the news server. disconnect() This method disconnects the RServer class from the news server.

6.4.3 Local Server Class (LServer)

The LServer class is a subclass of the NServer class and implements the news server interface as speci ed by the NServer class. The LServer class uses a local news database to provide newsgroups and news articles. The LServer class needs not implement any method in addition to the ones de ned in the NServer class.

6.4.4 Cache Server Class (CServer)

The CServer class is a subclass of the RServer and the LServer class and implements the news server interface as speci ed by theNServer class. The CServer class is used to retrieve newsgroups and articles from a news server using NNRP. Before these groups and articles are passed back to the caller, the data are cached in a local news database. The local news database is used to satisfy successive requests to the same data. In addition to the methods de ned in its superclasses, the CServer class implements the setttl(ttl list, ttl desc, ttl group) method. This method is used to specify the timeouts used for the ActiveDB (ttl list), the group description le (ttl desc) and for the newsgroups (ttl group). Whenever a client requests a database being expired, the databases is requested from the news server again.

6.5 News Database The news database consists of the active database (ActiveDB class), the overview database (OverviewDB class) and the article database (Article class). Figure 6.3 shows the relation between the database classes and the news server class (NServer ). The gure is slightly di erent to the one presented in section 5.2,

CHAPTER 6. IMPLEMENTATION

46 Article

ActiveDB OverviewDB

NServer

Figure 6.3: Relation between news server class and the news database because the functionality of the Newsgroup class has been integrated into the NServer class. However, this might change in a future release. In section 5.5, we have explained that the news database exists in two forms. One type uses non volatile memory and the other one uses volatile memory to store the database. In the current version, we have only implemented the database for non volatile memory. This will result in a slightly slower performance in cases where the volatile database would be sucient. The non volatile classes use the Non Volatile Container Class Library to store their data.

6.5.1 Non Volatile Container Class Library

The Non Volatile Container Class Library is the basic building block for the classes storing the news database. It implicitly stores data structures on external memory (usually a le in the local le system). The external memory is mapped into the process's memory space and all data items will be stored there.

NVcontainer Class The NVcontainer class provides the basic methods for the classes of the Non Volatile Container Class Library. Each le starts with the header explained in table 6.2 and shown in Figure 6.4. This header is used to identify a le generated by an NVContainer class and to check whether the size of the le has been changed by another process. In the latter case, the whole le is remapped with the new size. hlen version

size

mtime

freelist

bytes_free

userdata

Figure 6.4: Format of an NVContainer 's header The NVcontainer class provides methods to lock and unlock the database.

CHAPTER 6. IMPLEMENTATION

Header-Entry uint32 hlen

47

Description Length of this header in Bytes. This entry is used to identify an NVContainer. uint32 version Version of the NVContainer Library. This entry is used to identify the version of the NVContainer that created this database. uint64 size Size of the database. This entry matches the size of the le and is used to check, whether another process resized the container. uint64 mtime Time of last \logical" modi cation. This time has to be set by the user of the NVContainer. Hence the timestamp does not necessarily report all modi cation to the database. uint64 freelist Pointer to the list of free memory blocks. Free memory blocks are currently managed by a linked list. uint64 bytes free Number of free bytes in the NVContainer. This entry can be used to calculate the fragmentation of the free memory. uint64 userdata Pointer to the data provided by the NVContainer 's subclasses (e.g., a pointer to the head of the linked list, a pointer to the hash table, etc.). Table 6.2: The header of an NVContainer

CHAPTER 6. IMPLEMENTATION

48

In the current version, only the whole database can be locked. These methods are necessary to guarantee mutual exclusion. Whenever the database is accessed it has to be locked by a shared lock. A shared lock allows other processes to access the database, but does not allow to modify the database. An exclusive lock guarantees the exclusive access to the database and allows to modify its contents. Other methods are provided to open and close an existing database and to set and get the time of the last modi cation. Table 6.3 gives an overview of the NVcontainer class's methods.

lock(int type, int block) Depending on the type parameter the database

will be locked (exclusive or shared) or unlocked. The block parameter de nes whether the system call should block if the request cannot be ful lled immediately. get lock() returns the process currently holding the lock an the database. If several processes hold a shared lock on the database only one processes is reported. open(char *dbname, int ags) generates a new list, whose contents are mapped from the le dbname. Currently, the flags parameter is ignored. In future releases this ag will indicate, whether a new database shall be created. is open() checks whether a database is associated with this NVContainer. close() closes the database opened with open(). setmtime(int tm, int force) sets the modi cation time of the list to tm, either if the force parameter is true, or if tm is later than the current modi cation time. getmtime(int *tm) stores the modi cation time of the list in *tm. Table 6.3: NVcontainer 's methods A detailed explanation of the design decisions and the implementation of the Non Volatile Container Class Library can be found in [Gsc97].

NVlist Class The NVlist class provides an abstract class for list manipulations. It provides methods to add and remove data items to a list. Hence, it is used as a base for

CHAPTER 6. IMPLEMENTATION

49

the NVList and the NVHash classes.

NVList and NVHash Classes

The NVList class is based on the NVlist class and provides all functions necessary for list manipulations. Figure 6.5 shows how the list is organized in non volatile memory. The pointers are stored as le o sets relative to the beginning of the le. The methods provided by the NVList class are shown in table 6.4. The NVHash class is also based on the NVlist class and provides all functions necessary for the manipulation of a hash table. Figure 6.6 shows the organization of the hash table in the external memory. The hash table is externally linked. The UserData pointer points to an array of pointers. Each of these pointers points to a list that is organized in the same way as the list of the NVList class is organized. The methods provided by this class are shown in table 6.5.

CHAPTER 6. IMPLEMENTATION

50

Header FreeList

First Free Block

UserData

First Data Block

Second Free Block Second Data Block

... Third Free Block

Last Data Block Last Free Block

Pointers used to maintain the freelist Pointers used to maintain the list

Figure 6.5: Organization of the NVList class

NVList(char *dbname) creates an NVList and opens the database dbname. is empty() checks whether the list is empty. prepend(char *data,int szdata) prepends the data pointed to by data

with the size of szdata bytes to the list. append(char *data,int szdata) appends the data pointed to by data with the size of szdata bytes to the list. remove() deletes the rst data item from the list. clear() removes all data stored within the NVList. Table 6.4: NVList 's methods

CHAPTER 6. IMPLEMENTATION

Header FreeList

51

UserData

Hash Table

.. . Lists

First Free Block

. ..

Last Free Block

Pointers used to maintain the freelist Pointers used to maintain the hash table

Figure 6.6: Organization of the NVHash class

NVHash(char *dbname, int hashsz) opens a new NVHash container stored in dbname. If the container does not exist already, a new one is generated with a hash-table size of hashsz. gethashsz() returns the currently selected hash-table size. sethashsz(int hashsz) sets the size of the hash-table to hashsz. This method is not implemented currently. clear() removes all data stored in this container. is empty() checks whether any data are stored within NVHash. Table 6.5: NVHash 's methods

CHAPTER 6. IMPLEMENTATION

52

6.5.2 Active Database

The Active Database holds a short summary of the available newsgroups on the news server. The information about a newsgroup is passed to and returned from this database using class GroupInfo. The GroupInfo class stores:  article numbers of the rst and last article,  number of articles available in this newsgroup,  type of this newsgroup (e.g. whether the group is public, moderated or read-only) The active database is stored by the ActiveDB class. ActiveDB is an abstract class providing the interface for any active database. Currently only the Non Volatile Active Database (NVActiveDB) class exists. Other classes mentioned in section 5.5.2 will be implemented in future releases. Since we only have implemented the NVActiveDB it is used for the RServer class as well. As a result the performance of the RServer class may degrade slightly. ActiveDB

News Server

6. active

NServer 7. data

RServer

8. read

LServer

5. active 1. active 10. activeDB

CServer

2. valid? 3. yes/no 4. lock 9. unlock

Figure 6.7: Interaction of the ActiveDB class with other classes Figure 6.7 shows the relation of the ActiveDB class to the other classes and the ActiveDB 's interaction with the CServer and RServer class: 1. A client requests the active database from the CServer using the CServer 's active method. 2. The CServer class checks whether the contents cached in the ActiveDB class have timed out. This is done using ActiveDB 's getmtime method.

CHAPTER 6. IMPLEMENTATION

53

3. If the data held in the ActiveDB is valid, the CServer proceeds with step 10. 4. The data stored in the ActiveDB is invalid and the ActiveDB is locked for an update. After the the database has been locked, the CServer class checks, whether its active database stored within the ActiveDB class has already been updated by another process. For simplicity, this is not shown in the gure above. 5. The CServer class requests the RServer class to retrieve the active database from the news server. 6. The RServer class contacts the news server and requests the active database. 7. The news server sends back the active database. 8. The RServer class noti es the ActiveDB class to read the data sent back by the news server using ActiveDB 's read-method. 9. The CServer unlocks the active database. 10. Finally the CServer returns a pointer to the ActiveDB. The methods implemented by the ActiveDB class are shown in table 6.6.

add(GroupInfo &group) adds the group speci ed by the group-parameter

and its associated information to the active database. set(GroupInfo &group) updates the information stored for the newsgroup group. get(char *name, GroupInfo *group) retrieves the information associated with the newsgroup name stores it in the group parameter. read(istream &is) reads an active database from the input stream is and stores it in the database. write(ostream &os) writes the active database to the output stream os. Table 6.6: ActiveDB 's methods

CHAPTER 6. IMPLEMENTATION

54

6.5.3 Overview Database

The Overview Database stores information for each article of a newsgroup: the subject and author of the article, the date the article was posted, the article's identi er and size and to which other newsgroups the article was posted. Currently only the non volatile version of the overview database (NVOverviewDB ) has been implemented. Hence this class will be used for the RServer class as well. As is the case for the active database, this may degrade the RServer class's performance slightly. 11. read

OverviewDB

News Server

GroupInfo NServer

6. group

9. xover 10. data 7. data

RServer

8. setdata

LServer

5. xover 1. xover 13. overviewDB

CServer

2. valid? 3. yes/no 4. lock 12. unlock

Figure 6.8: Interaction of the OverviewDB class with other classes Figure 6.8 shows the interaction of the overview database (OverDB ) with the other classes. 1. A client requests the overview database using the CServer class's xovermethod. 2. The CServer class checks, whether the data stored for the currently selected newsgroup have timed out. These data are stored in the GroupInfo class that also stores the number of the rst and last article and the total number of articles of the currently selected newsgroup. 3. If the overview database is still valid, the CServer class proceeds with step 13. 4. The CServer locks the overview database (OverviewDB ) for an update.

CHAPTER 6. IMPLEMENTATION

55

5. The CServer requests the RServer class to transfer the overview database from the news server. 6. The RServer class connects to the news server and selects the appropriate newsgroup. 7. The news server returns some information about the current newsgroup. This information consists of the number of the rst, the last article and the total number of articles. 8. The RServer passes the information received by the news server to the GroupInfo class. The GroupInfo class stores this information and updates its modi cation time. 9. The RServer class requests the overview records for newly arrived articles from the news server. 10. The news server sends back the requested overview records. 11. The RServer informs the OverviewDB class to read and store the data returned by the news server. 12. The CServer unlocks the overview database. 13. Finally the CServer returns a pointer to the overview database. The NVOverviewDB is based on the NVList class and allows to call the inherited methods lock(), get lock(), open(), is open(), close(), setmtime(), getmtime(), clear() and is empty(). In addition to these methods, the NVOverviewDB provides the methods shown in table 6.7.

6.5.4 Article Database

Currently each article is stored using its own le. We have implemented a class Article that stores a single news article. This class implements the methods necessary to store and read an article to/from the le system. The Article class is implemented using a Text class that allows to allocate strings of unlimited length. We did not use the String class provided by the GNU C++ library, since this class limits a string to approximately 35000 characters.

6.6 Miscellaneous Classes For the implementation of the News Cache the following classes have been implemented in addition to those described in chapter 5. These classes are mostly classes used to provide the multiplexing functionality explained in section 4.3 or to provide better access to other resources used by the News Cache.

CHAPTER 6. IMPLEMENTATION

56

select(char *dbname) closes the currently opened database and opens dbname

as the class's new database. read(istream &is) reads an overview database in the way provided by the xover command from the input stream is. write(ostream &os, int fst, int lst) prints the records for the articles fst{ lst provided by the overview database. The records will be written to the output stream os in the format provided by the xover command. expire(int fst, int lst) expires the overview records for the articles fst{lst. rstid() returns the number of the rst article, whose overview record is stored in the overview database. lastid() returns the number of the last article, whose overview record is stored in the overview database. Table 6.7: OverviewDB 's methods

6.6.1 Multiplexing List

The Multiplexing List (MPList) is necessary for the News Cache's multiplexing functionality. It decides which news server to contact for which newsgroup. The MPList class is a list of MPListEntries. Each MPListEntry stores which newsgroups should be used from which news server. Table 6.8 shows a sample MPList. News Server Newsgroups Description news.tuwien.ac.at:nntp * All newsgroups should be provided news.wu-wien.ac.at:nntp at.wu-wien.* The newsgroups of the \Wirtschaftsuniversitat Wien" .NoServer alt.binaries.*,soc.* Newsgroups that should be censored Table 6.8: A sample MPList

6.6.2 sstream

This class provides a socket stream. It allows to connect to a TCP service. Data written to this stream will be sent to the service. Data sent by the service can be

CHAPTER 6. IMPLEMENTATION

57

read from this stream. It inherits the methods provided by fstream and provides the following additional methods.

connect(char *name, char *service) connects to service at host name. The

parameter can either be a valid service name or a port number preceded by the # sign (e.g., #12000). is connected() returns whether sstream is connected to a host. This method has the same semantics as the is open method. disconnect() disconnects from the service. service

Chapter 7 Evaluation 7.1 Access Patterns The News Cache has been installed at Technische Universitat Wien and the whole campus had the possibility to test the News Cache. Unfortunately we had only a very slow machine (the Intel 486 machine shown in section 7.3 in environment 2) for the test of the News Cache. This may be the reason why the News Cache has only been used by few people. Additionally those people had rather di erent interests. Hence we expect higher hit rates in situations where the users must use the News Cache. Total OverviewDB Groups Articles Requests 14016 947 10950 1392 Misses 9221 352 7436 1183 Hits [%] 34% 62% 32% 15% Table 7.1: Access statistics Table 7.1 shows the results of our test phase. With more accesses to the News Cache we expect to get higher hit rates, especially when people from similar departments access the News Cache. With a higher use of the News Cache we expect hit rates of approximately 65%. The users of the News Cache have been accessing 217 di erent newsgroups, where the top ten newsgroups make over 50% of the total accesses. This proves that there is some locality in the references to newsgroups even if people with di erent interests access the News Cache.

58

CHAPTER 7. EVALUATION

59

7.2 Tested News Readers The following news readers have been tested for their compatibility with the News Cache:

Gnus works without any problems with the News Cache. Netscape issues the GROUP command for every newsgroup to get a better estimation for the number of news articles within the newsgroup. Hence we optimized the News Caches active database to make this delay unnoticeably small. Tin works without any problems with the News Cache. Xrn does not work in the current release, because it uses the XHDR command that currently has not been implemented. This command has not been implemented yet, because we did not know that it is still being used. The XHDR command has not been available within the last NNTP Internet draft of common NNTP extensions. However in the new draft this command has been explained again [Bar97].

7.3 Performance Data The cache server has already been tested for two environments. In both environments the news server has been news.tuwien.ac.at. This news server holds approximately 35000 newsgroups.

Environment 1

News server news.tuwien.ac.at Machine Cyrix 6x86 166+, 133MHz, 32MB memory (160MB virtual memory),

512KB second level cache, 10MBit network interface. The machine has been running the News Cache software only. Client DEC Alpha 21164, 366MHz, 1MB cache on the same local network.

Environment 2

News server news.tuwien.ac.at Machine Intel 486, 66MHz, 24MB of memory (48MB virtual memory), 64KB second level cache. The machine is heavily loaded since it is also running the squid WWW cache. Squid usually uses 20MB of memory with 10MB usually being in core.

CHAPTER 7. EVALUATION

60

Client Sun on the same local network. Table 7.2 shows the performance measurements of two tests. The rst column indicates the test. The second and third columns show the times required to satisfy the request by the News Cache and the time needed to satisfy the request from the news server (news.tuwien.ac.at). The cache column gives the values for a cache miss (marked by an \M") and a cache hit (marked by an \H"). The rows marked with \50%M" shows the values for the same request, where every other request is a cache miss. Some of the entries in the news server column have been omitted since a cache miss cannot occur for requests to the news server. Test Retrieving the active database

Cyrix I486 Cache Server Cache Server 14s 228s 2s 12s 19s 20s 26s 52s 20s 44s 14s 10s 34s 21s 32s 292s 10s 24s 62s 38s

M H Selecting a newsgroup and reM trieving 500 articles 50%M H Selection of 150 newsgroups and M retrieving the overview database H of each Table 7.2: Performance measurements



Depending on the load caused by the squid WWW cache, these data may vary

Chapter 8 Future Work 8.1 Improvements The following sections describe improvements that will be made to the News Cache to increase its performance. Hence no extra functionality will be visible to the users of the News Cache.

8.1.1 News Database

In future releases, we will provide the full class hierarchy for the active and overview databases. In the current release we provide only classes to store the databases on non volatile memory (usually a le on the hard disk). However, since the RServer class does not need to provide a persistent news database, it makes sense to store the databases in volatile memory. This will improve the performance for the RServer class. To reduce the requirements on the lesystem, the format of the article database will change in a future version. Currently we consider to extend the Non Volatile Classes Library to provide the ecient management of articles. However, currently this part of the database has not been designed.

8.1.2 Caching Granularity

The caching granularity of the news cache may either be based on articles or on newsgroups. Article based caching reduces the required disk space and network bandwidth, because only articles being requested by a client will be cached. On the other hand newsgroup based caching reduces the number of network connections made to the news server and the load caused to the news server. Additionally newsgroup based caching reduces the total transmission time, because the articles are requested in fewer, but bigger junks. This improves the response time for successive requests to such newsgroups. 61

CHAPTER 8. FUTURE WORK

62

For small newsgroups the granularity of newsgroups seems to be better, because they require very little disk space and all data are requested in one junk. However, for large newsgroups, especially newsgroups with large articles, like newsgroups storing pictures or programs, an article based caching mechanism is better. In one of the next releases we want to add this as an option to the user con guration. By default the newsgroups will be cached on an article based strategy. Newsgroups speci ed by the user will be cached based on a whole newsgroup strategy.

8.1.3 Expiration of News Articles

The News Cache is allowed to allocate a xed amount of disk space for his news database. If the maximum available space has been allocated by the News Cache, older or less frequently requested articles have to be discarded. The same problem exists for other caches as well and many available publications deal with this topic ([BDH+94], [Sta92], etc.). Currently we did not analyze this problem in detail. In the current release we use an external program that removes the least recently used articles rst. The advantages and disadvantages of di erent expiration strategies are:

Least Recently Used expires the article's that have not been accessed for the longest period rst.

Least Frequently Used This strategy removes less frequently used articles

rst. However, this solution may discard newer less frequently articles in favor of older articles that have been used frequently in the past. Oldest Article First to remove older articles rst, because they will expire before the other articles expire. Biggest Article First expires those articles with the biggest size rst. This eliminates articles usually found in binary groups. The elimination of these articles costs the least time and brings the most disk space bene t.

For a solution to this problem, statistics for all solution strategies (including combinations of these solution strategies) have to be collected. Based on this statistics we will decide for the nal expiration behavior.

8.2 News Cache Extensions Besides the basic functions presented in 4.7, the News Cache might o er additional functionality like local newsgroups stored within the news cache or to prefetch some newsgroups at special times of the day, etc.

CHAPTER 8. FUTURE WORK

63

8.2.1 Local Newsgroups

The current release of the News Cache allows to set up local newsgroups as de ned in section 3.2. The administrator has to set up a local news server holding the newsgroups that should be provided locally. Then the News Cache has to be con gured to contact the local news server for the locally spooled newsgroups. Figure 8.1 shows this con guration. global.*

Computer Science Dep.

Maths Dep.

local.cs.*

local.math.*

Client

Client

Figure 8.1: Merging local and global newsgroups by an intermediate cache server However, this setup costs more performance and more administrative work than necessary. A better solution could be to integrate the management of the local newsgroups into the News Cache. This has the advantages that the administrator needs not to set up and con gure a local news server. This extended functionality of the News Cache will be available in one of the News Cache's future releases.

8.2.2 Prefetching

Prefetching means that all new articles of some or all newsgroups available on the news server will be cached prior to a user request. This is useful for newsgroups being accessed frequently, because it reduces the response time for the rst request. Additionally this may balance the network load if those frequently used newsgroups are cached when the network has fewer load. For example, the computer languages department may want to download the comp.lang newsgroups at 6 am. This has the advantage that the news articles are immediately available when somebody logs in and starts reading a comp.lang newsgroup.

8.2.3 Oine News Reading

The News Cache may even provide the possibility to read news oine. To do this the newsgroups subscribed have to be cached when the News Cache is online.

CHAPTER 8. FUTURE WORK

64

Afterwards, it may supply the cached news articles without the necessity of a network link to its news server. For example, somebody attached with a modem to the Internet does not want to be connected permanently (especially in cases when the telephone line is billed for each minute like in Europe). Hence he might want to download all the newsgroups he wishes to read at once and read those newsgroups afterwards. In the current version, this is possible with limited functionality. Whenever you are connected to your newsfeed, you may run the getnews utility provided with the News Cache system. This retrieves all the groups and articles you are subscribed to. However, this is rather user-unfriendly and will change in future releases.

8.3 Extensions to NNTP The new streaming mode already allows a better way for news exchange between news servers. This mode is explained in [Bar97]. However, news is still lacking a way to compress articles before they are transferred on the network. This is especially necessary and useful for networks connected over limited bandwidth links. To eliminate this problem, we suggest the implementation of a new transfer mode to allow the compressed transmissions of articles. However, this is only a proposal for the discussion of the commands explained here and no bullet proof extension. The compression extension is activated using the mode compressed and deactivated using the mode uncompressed commands. Other commands should be de ned for the negotiation of the compression algorithm. However, since the compression of the articles may imply a higher processor load on the news server the news server is allowed to neglect this extensions for some or all hosts. We expect that the transmission of news in a compressed format can reduce the required network bandwidth by over 25%. Articles containing readable text only can usually be compressed by over 50%. Binary or compressed articles (e.g., program les) are encoded to a subset of the ASCII character set before they are transmitted by NNTP, because NNTP only guarantees to transfer a subset of the ASCII character set correctly. Hence those compressed and uuencoded or base64 encoded can still be reduced by 25%. As soon as the compression is turned on, the news server may decide before transmitting an article whether the article is transmitted compressed or uncompressed. This decision may be based on the article's size and compressibility. Whenever the news server transfers an article in compressed format it indicates this using the reply code (see appendix B for an explanation of NNTP). For example, this may be done by using a di erent numeric reply or by appending a \c" for compressed to the numeric reply code.

Chapter 9 Related Work In late summer 1996 a news cache similar to our News Cache has been released as a beta version. This other news cache is called nntpcache and is currently available in version 1.0.3. The nntpcache is available from ftp://ftp.suburbia.net/pub/nntpcache/. It has been written entirely in C and uses memory mapped les similar to the News Cache. For the test of the nntpcache we only had a DEC Alpha computer running Linux available and the nntpcache crashed soon after it was started. However, this may be related to a type problem (32/64Bit integers). Nntpcache works on Intel Pentium machines running Linux. Unfortunately we could not nd any work published along with the nntpcache or dealing with it in any form. Hence we cannot provide comparisons to the nntpcaches design decisions.

9.1 Comparison

9.1.1 News Database

The nntpcache stores its active database using a dbm database, while the News Cache uses the NVHash class of the Non Volatile Container Library. The overview database is stored by the nntpcache using its own memory mapped database similar to the News Cache. However, nntpcache has the disadvantage that it uses multiple les for a single overview database. This puts higher demands onto the le system. The article database is managed in exactly the same way. Both caches use the format used by the INN news server.

9.1.2 Functionality

In its current release nntpcache has the following features that are currently not supported by News Cache: 65

CHAPTER 9. RELATED WORK

66

 Support for access control based on the client host and the client hosts

identd. The identd reports the name of the user that initiated the connection.  Support for ltering and censoring based on newsgroups and articles for speci c clients. Filtering eliminates the newsgroups transparently, while censoring informs the client that the newsgroup or article has been censored. Censoring is not fully implemented in the current release.  Unknown commands can be passed on to the news server. However, if this feature is turned on, the censoring and ltering of articles cannot be guaranteed since it may be possible to retrieve those articles using an unknown command.

The News Cache provides the following features not supported by nntpcache:

 Prefetching of newsgroups and articles. This means that a speci c news-

group and its articles are retrieved from the news server without a request from a client. The decision which articles have to be transferred are based on con guration les and heuristics.  The possibility to read Usenet News oine. The News Cache allows to retrieve newsgroups during times with Internet connectivity. These newsgroups may then be read by news clients even if no connectivity to the news server exists.  The News Cache can either be run as standalone daemon or it can be started using the inetd daemon, while nntpcache can only be started as standalone daemon.  News Cache supports to lter speci c newsgroups. We do not provide a content based article censoring mechanism, because this can restrict the right of free speech.

Chapter 10 Summary and Discussion In this thesis we have presented the Usenet News system and problems associated with its steady growth. These problems have been analyzed carefully. We have presented di erent solution strategies to solve these problems. We have analyzed the current situation carefully and have shown that the News Cache solution requires the least modi cations to existing Usenet News system and provides solutions to most of the problems. Table 10.1 shows the list of problems solved by the News Cache approach.

 The News Cache reduces the required network bandwidth necessary

for the provision of Usenet News.  The News Cache reduces the load on news servers caused by news clients.  The News Cache provides a simple solution for the provision of local newsgroups.  The News Cache is exible enough to provide a possibility to read news oine from the news server. Table 10.1: Usenet News problems solved by the News Cache

We analyzed the requirements of the News Cache and have taken special care that existing software will not break with the introduction of the News Cache. We have designed a database to store the various parts of the news database (active, overview and article database). For optimal performance, we developed a new container class library that stores its data on the le system implicitly using memory mapped les. 67

CHAPTER 10. SUMMARY AND DISCUSSION

68

We showed the implementation of the News Cache and the interaction of the News Cache's various parts and presented a News Server Class Library that allows the development of other programs that need to access a news database or need to provide access to it. We compared the News Cache with another news cache that came out while this thesis was under work. At the point of this writing this is the only other existing news cache. We presented an evaluation of our News Cache in section 7.1 and have shown that the News Cache reduces the load of the news server. Even with a moderate use of the News Cache it reduces the load of the news server by those clients accessing news through the News Cache by 30%. With a wider use of the News Cache we expect better hit rates, because the number of users is usually higher than the number of read newsgroups. We expect that the News Cache can reduce the load caused to the news server by approximately 65%.

Appendix A Notations A.1 Object Oriented Model The notation used for the object-oriented models has been taken from [Boo94]. Figure A.1 shows the symbols of the Booch Notation used in this thesis. ClassName Attributes Operations()

(a) Class Diagram

Association Inheritance Has Using

(b) Associations

A

Abstract Class

F

Friend

S

Static Member

V

Virtual

(c) Properties

Figure A.1: Notation for object-oriented models Classes are indicated using bubbles (Figure A.1(a)). A name is required for each class. If the class icon should include attributes or operations they will be separated from the class name by a line. The relations between classes will be indicated using association arcs (Figure A.1(b)). The following association types exist: Association This type of arc indicates that two classes have a semantic connection to each other. Associations are often labeled using nouns. Inheritance This type of arc indicates the inheritance relationship between two classes. Inheritance means that the subclass inherits all the methods and attributes de ned by its superclass. The arrowhead of the arc points to the superclass. Has This type of arc indicates a part of relationship. The end with the circle appears next to the class containing the other class. 69

APPENDIX A. NOTATIONS

70

Using This type of arc indicates a client/supplier relationship. The end with

the circle appears next to the client class. The properties icons (Figure A.1(c)) indicate special types of classes or associations. The abstract property may be present within a class's diagram indicating an abstract class. Abstract classes cannot be instantiated. Abstract classes de ne only the interface for some of their methods. The implementation of those methods has to be supplied by a subclass. This is useful for the exploitation of polymorphism (see [Boo94] for a detailed discussion). The other properties shown in gure A.1(c) indicate special types of associations. Friendship The friendship property may be applied to the supplier of a relation, denoting that the supplier has granted extended access to the client. Static An association marked as static points to an instance that is owned by the class and not by its individual members. Virtual Indicates that a class's attributes and methods should be inherited once only. This is necessary, when a class is inherited twice through the use of multiple inheritance.

A.1.1 Example

The friend property will not be shown in this example. An explanation along with examples can be found in [Boo94]. S

Transport A

V

Counter

V

Car

Ship

Amph. Car

Figure A.2: Example for object-oriented notation Figure A.2 shows a class hierarchy for di erent transport vehicles. The base of all transport classes is the abstract class Transport. Thus, Transport declares

APPENDIX A. NOTATIONS

71

only the methods that have to be implemented by its subclasses. This is indicated using the abstract property. To count the number of transport vehicles instantiated the transport classes use a class variable for the counter. Whenever a transport class gets instantiated the counter is incremented and whenever a transport class is destroyed the counter is decremented. Hence all transport classes operate on the same Counter class. This is indicated by the static property. The Car, Ship, and Amphibious Car classes inherit all methods and variables declared and de ned by their base class. Since the Car and Ship classes are derived from the Transport class virtually, the Amphibious Car class inherits the methods and variables de ned by Transport once only. Without virtual inheritance the Transport class would be allocated twice for the Amphibious Car class and hence the Counter class would be increased twice for Amphibious Car instances which is undesired in this situation.

A.2 Finite State Machine The notation used for the nite state machines has been taken from [Boo94]. Figure A.3 shows the symbols used in this thesis. State Icon

State Transitions event [guard] / action

Name Actions

start stop

Figure A.3: Notation used for nite state machines A rounded box indicates a state displaying the state's name (underlined) and the actions possible in this state. An arc indicates a state transition that is triggered by an event. The arc's label indicates the name of the event and can show the events guard and the actions taken. The start icon indicates the initial state of the state machine and the stop icon indicates the state machine's nal state.

A.2.1 Example

Figure A.4 shows a simple automated teller machine as a state diagram. The automated teller machine is turned o in the initial state. As soon as the operator turns on the machine it goes into the Ready state and waits for

APPENDIX A. NOTATIONS

72 Invalid[Retries

Suggest Documents