Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
The Architecture of the Obelix - An Improved Internet Search Engine Predrag Knezevic1, Bozidar Radunovic1, Nenad Nikolic1, Tijana Jovanovic1, Dusan Milanov1, Miroslav Nikolic1, Veljko Milutinovic1, Steve Casselman2, John Schewel2, Department of Computer Engineering1 School of Electrical Engineering POB 35-54, 11120 Belgrade, Serbia, Yugoslavia
Virtual Computer Corporation2 6925 Canby Ave, Suite 103, Reseda, California 91335, USA
E-mail:
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
Abstract Contemporary Internet search engine can hardly cope with enormous amount of indexed data, and very seldom succeed in extracting useful information. In order to anticipate usefulness of a web page, proposed solution keeps track of users actions on a web page. A modified web client informs a dedicated server of users' actions performed on a web page, such as printing, bookmarking or mailing. Latter search algorithm relies on these information when ranking pages according to their usefulness. The paper describes the proposed paradigm in details, and focuses on the architecture of the server dedicated to collecting information users' actions. The server’s architecture is based on reconfigurable technology and VCC HOT II reconfigurable boards.
1. Introduction Nowadays, Internet can be observed as a huge repository of all sorts of data, spread around the world. Due to non-central administration and updating, it is not possible to make a central index of available information. Therefore, search engines are one of the biggest obstacles of the contemporary Internet. There exist many search engines, with different approaches to the problem. Some are maintained with human help, and available data is classified into different categories according to human decisions. Others try to rank documents according to the keywords. More advanced techniques, like Clever [1,2] or Google [3,4], include hyperlink citations into ranking algorithm and boost up more cited sites in the result queries. Proceedings of the HICSS-2000, Maui, Hawai'i, USA, January 2000.
Nevertheless, it is still left to a machine to interpret the subject of an ordinary document, and estimate its importance. Therefore, when using search engines, users often get documents in the result set of questionable usefulness.
2. Problem statement In order to introduce human factor in ranking algorithms, the Obelix system collects the input from users; it gathers information about users’ actions while visiting a document: printing, saving, bookmarking, and rank pages according to the assigned events. Thus, a page that has been visited thousand times will get higher rank then a page that has never been visited, even if the second one has more keyword matches than the first one. Similar approach has been done by DirectHit search engine [5]. It collects information about users' Internet browsing, assigning more visited pages higher ranking position. However, while DirectHit statistic includes only page visits, Obelix engine observes a larger variety of events. The goal of the Obelix system is to collect the access information and to provide it to the search server in the appropriate form. By performing these tasks, an Obelix server needs to process a large amount of data. In order to conduct a real-time data processing and acquisition, reconfigurable architecture is introduced.
3. Proposed solution The Obelix search engine uses slightly modified web client-server approach. Namely, web browser is modified to inform Obelix server(s) about client actions trough the datagram connection. A connectionless protocol was
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
1
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
chosen due to a high information workload. A loss of a single packet is of a minor importance to the search server, while a connectionless protocol reduces communication overhead. Each packet created by web browser contains information about the URL, and the performed action (saving, printing, mailing…) with its score. When a user makes a search request, the search engine ranks web pages according to the number of the searched words in the page (or some more sophisticated conventional algorithm), and a Casselman score, which represents the sum of overall scores of user actions. These scores are combined into a general score according to the following formula:
S = S f ⋅ ∑ Sc DB
where S is general score, Sf is the frequency score, and Sc is the Casselman score. The Obelix engine receives and processes the packets. In addition to the search server engine, like Altavista or Lycos, the search server site includes also the Obelix server collecting information about users’ actions; it consists of a machine that accepts PCI cards (e.g., a PC or an Alpha) with a number of VCC Hot II boards [6] (based on Xilinx 4000 or Spartan) plugged into the PCI slots. The number of VCC Hot II boards depends on the system capacity and traffic load, and could vary in a relatively large range. The information’s acquisition process consists of three phases: packets’ collection, calculating sums, and transferring results to the database. Every phase is implemented as a separate HOT II configuration.
3.1. Information collecting phase A modified web browser pack information into an IP packet of the following form: Packed ID Number of actions in the package Packed URL Performed action (1 byte) … Packed URL Performed action (1 byte)
The URLs are packed with static Hoffman compression. This way, there is no need to ever unpack URLs on the server side, since the same URLs from different clients have the same code. The actual codes are predefined on a large statistical sample of URLs. Users' actions are defined to cover most of the users’ operations with a web page. Actions included in the present version of server are: -
Visiting a page Saving a whole page Saving a part of a page (image or other object)
-
Printing Following links from a page Cut/copy Bookmarking Custom dialog for page ranking Time spent visiting a page
After being received by server’s network card, a packet travels PCI bus toward the appropriate HOT II board. A board receives a package and stores it into its memory for latter processing. There are three important issues to consider while designing collecting phase: load balancing, detecting fake data, and fault-tolerance. If several boards are present in the system, one has to design the policy of assigning packets to boards. One possible set of policy is based on sender’s IP number, while other one is server-controlled distribution to idle board. While the first solution suffers of loss of packets if a destination board is in the processing phase and thus it cannot accept any incoming packets; the second one minimizes number of lost packages but generates higher traffic on the PCI bus. Serving in a real world, the Obelix system is open to different misuses. Web site owners might generate fake traffic to boost up their web site scores. The Obelix server includes detection and prevention of these cases. Each board includes event table in its memory, with a following structure: Client IP addr.
No. of req.
Last arrival time
Each client is allowed to send a certain number of actions during a defined period of time; others are ignored. Last arrival time field is updated whenever a new action comes, with arrival time more than defined period of time latter then Last arrival time. Another important issue is the possible fault-tolerance of the system. Namely, if the appropriate or all HOT II boards are busy calculating sums, then a packet that is arriving trough PCI bus gets lost. Regarding this fact, one can define two policies, one that permits and the other that doesn’t permit loss of packages. In case of systems with high workload, possible loss of some portion of user information is acceptable, and sometimes even preferable if reducing communication load. Nevertheless, a fault-tolerant algorithm was developed for systems needing it. A faulttolerant system stores packets into the main memory apart from storing it into the boards, into an entry like following: Ack
Arrival time
Received IP package
If a board collects the package from a PCI bus, it sends package ID to the CPU, setting Ack bit to one. Package ID is a 16-bit word randomly defined by client. It is not unique in any sense, rather random enough to reduce
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
2
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
possible collision and loss of packages. After some time CPU goes trough the table and processes all nonacknowledged entries, deleting others.
3.2. Calculating weighted sums When a board receives a certain number of packages, it starts processing it. The purpose of this processing is to make weighted sum of all actions related to a certain URL, where each action has its own weight according to its importance (printing is more important then visiting, etc.) Naturally, not all occurrences of an URL will appear on the same board. Therefore, extra processing in the database system is needed, but its time is greatly reduced by calculating these partial sums in HOT boards. There are several basic units in a board's Xilinx chip. Each one has the architecture as shown below: Input URL
Input action
Stored URL
4. Simulation results All the proposed ideas have been simulated in order to observe the system’s behavior and performance for different parameters. There are two main parts of the simulation: architectural simulation and search simulation. While the first one tests the hardware’s ability to process the input users’ actions, the second one is focused on the efficiency of the search engine itself, and its result sets. Furthermore, several different simulations have been made for determining various system parameters, as described latter.
4.1. Statistical analysis of URLs Prior to developing architectural simulation, a static Huffman compression codes had to be determined, and an optimal length of a URL buffer had to be determined. In order to calculate codes, statistical analysis over several megabytes of Squid cache access log was performed. 25
Sum Code length (bits)
20
Compare
15 10 5
Weighted sum
Figure 1 - Structure of a basic processing unit The basic units are connected into a daisy chain (Figure 2). A unit compares the stored URL with a URL on the input. If they are the same, the unit updates the weighted sum, and invalidates the input into the HOT II board's memory. Otherwise, it forwards the package to the next unit. If there is no URL stored into the unit, the unit stores the input, updates the sum and invalidates the data into the memory. In this case, the speedup depends of the number of the basic units. After finishing the pass, it makes a new entry into the memory with the URL and a calculated sum. This record is latter forwarded to the main memory and to the database system. URL Action
Basic Unit
Action
URL Action
Basic Unit
...
Invalidate
URL
Figure 2 – A daisy chain of basic units
x
•
s
i
n
^
d
T
Y
J
O
E
; @
6
1
'
,
!
0 character
Figure 3 - Code length of a compressed character in bits. URL access log was taken from the news agency’s web server. Static Huffman compression is implemented, as it guarantees the same code for the same text, so there is no need for actual decompressing latter. The most frequent characters, such as “.” and “/” have the shortest code. Figure 3 depicted obtained codes. By taking statistic out of compressed log file, the optimal length of URL buffer of 66 characters was determined, since 99% of observed URLs are shorter then 66 characters, as can be observed from Figure 4.
4.2. Architectural simulation A simulator of the proposed architecture was developed in C on a Linux system. It consists of four phases: CPU, PCI, HOT and DISK, where each task may pass trough each phase one or more times. When a user action packet arrives to HOT phase, it is stored and removed from the system. When enough user action tasks arrive to a board, a new task for calculating sums is created.
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
3
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
CPU, PCI bus, and HOT II boards and disk system. Another parameter is a number of stalled tasks. Namely, when a user action task arrives in the system, it searches for an idle board. If all boards are calculating sums at that moment, the task is stalled, and is either lost or left to CPU for processing. The last important parameter is average number of occurrences of an URL within board’s memory. The higher is the number of occurrences, the faster processing is, and the less processing is left for a database system.
350 300
URL length
250 200 150
1. 45 e02 s 1. 45 e01 s 1. 45 e+ 00 s 1. 45 e+ 01 s 1. 45 e+ 02 s 1. 45 e+ 03 s 1. 45 e+ 04 s 1. 45 e+ 05 s
100 50 1.00E+00
0
1.00E-01
0
10
20
30
40
50
60
70
80
90 1.00E-02
percentage (%)
Figure 4 - Probability of compressed URL lengths. URL access log was taken from the news agency’s web server. This is the corresponding figure with compressed URLs to figure 7. As can be seen from the figure, most of the URLs are less then 100 characters long.
1.00E-03 1.00E-04 1.00E-05 1.00E-06 1.00E-07
2MB
1MB
512kB
256kB
128kB
64kB
32kB
16kB
8kB
4kB
1.00E-08 1.00E-09
s
s 1.
45
e+
05
s 1.
45
e+
04
s 45
e+
03
s
02 1.
1.
45
e+
01
s 1.
45
e+
00
s 1.
45
e+
01 e-
45 1.
45
e-
02
s
Figure 5 - Represents CPU usage for system with two HOT II boards, four basic processing units, and 25MHz HOT II clock. The CPU usage is low, less then 10%, and it is almost the same regardless of the workload. During almost the whole processing phase, the CPU is idle, so there are no obstacles in implementing fault-tolerance or CPU-controlled load-ballancing.
1.
Simulation was performed on two different logs: one from a news agency, and one from a small Internet provider. Original workload was about 10-1 request per second for an Internet provider, and 10-2 requests per second for a news agency. These log files were scaled in time up to 105 requests per second in order to simulate a higher workload, on a shorter period of time, since a genuine one-week log file with average 105 requests per second would be impossible to store and process. We assume that the observed maximum average is comparable to the average of the targeted high-end Internet search servers, while lower averages simulate enterprise or company Intranet or Internet search servers. Another interesting aspect of the observed log files is their origin. In the case of the news agency, we have a company web server providing information to the outside world, and logging access to their web server. This example simulates a company that wants to improve its own search engine, limited to a relatively small search space (i.e. available products or services, in this case news briefs). Therefore, the average number of occurrences of the same URL within a client package is large, as can be seen latter from Fig 12a. In the case of the Internet provider, actual search space includes the whole Internet, thus the average number of occurrences of the same URL within a client package is small, as can be seen latter from Fig 12b. Observing performance results from both log files one gets a complete overview of the response of the Obelix server in different tasks. There are several parameters of importance that are considered when evaluating performance of the system. One set of parameters is usages of system resources:
1.00E+00 1.00E-01 1.00E-02 1.00E-03 1.00E-04 1.00E-05 1.00E-06 1.00E-07 1.00E-08 1.00E-09
2MB
1MB
512kB
256kB
128kB
64kB
32kB
16kB
8kB
4kB
Figure 6 - PCI usage for the system with two HOT II boards, four basic processing units and 25MHz HOT II clock. Similar to the CPU, PCI busis idle most of the time and can be used for other algorithms, such as fault-tolerance and CPU-based load balancing.
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
4
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
45.00 40.00 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0.00 2M B 1M 51 B 2k B 25 6k 12 B 8k B 64 kB 32 kB 16 kB 8k B 4k B
Average URL occurences
present in the system. On the other hand, disk usage is higher for less memory. The reason for such behavior is the nature of the processed data. Namely, gathered client information contains the same URL several times. By calculating weighted sums, board eliminates repeated occurrences of an URL and thus compresses the package, as can be seen from a statistic of a package, shown in Fig 8 and Fig 9.
Board's memory
Figure 8 - Average number of occurrences of the same URL within a client package (the news agency). The statistic is taken from a log file of a news service providing its own information. Target information space is small, thus the number of occurences is high.
4k B
8k B
16 kB
32 kB
64 kB
12 8k B
Figure 7 - DISK usage for system with two HOT II boards, four basic processing units and 25MHz HOT II clock. Disk usage is higher for systems with less onboard memory; the bigger buffer we have, the more URLs are processed at the time and the more occurrences are observed. Thus the output packets are smaller, and furthermore less processing is done later at the database.
25 6k B
1MB 256kB 64kB 16kB 4kB
1M B 51 2k B
2MB 512kB 128kB 32kB 8kB
10.00 9.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 2M B
1.00E+00 1.00E-01 1.00E-02 1.00E-03 1.00E-04 1.00E-05 1.00E-06 1.00E-07 1.00E-08 1.00E-09 1.00E-10 1.00E-11 1.00E-12 1.00E-13 1.00E-14
Furthermore, the smaller packages are, the more processing has to be done afterward, in order to join the statistic from many small packages. Hence, apart from the presented benefits of more memory in the system, one should keep in mind that latter processing is greatly reduced by introducing more memory. Another important parameter, as already said, is average URL occurrence within the board memory at the time of processing. As one can see from Figures 8 and 9, number of occurrences drop as the on-board memory decreases. Additionally, number of occurrences is lower in general Average URL occurences
1. 45 e1. 02s 45 e1. 01s 45 e+ 0 1. 45 0s e+ 0 1. 45 1s e+ 0 1. 45 2s e+ 0 1. 45 3s e+ 04 s
As one can see from figures 5, 6 and 7, usage of main system resources is less then 5%. The CPU and PCI usage linearly arise as the workload go higher, and they are almost the same regardless of the amount of the memory
Board's memory
Figure 9 - Average number of occurrences of the same URL within a client package (the Internet provider). The statistic is taken from a log file of a proxy cache of an Internet provider. Target information space is large, thus the number of occurrences is low. in the log file of an Internet provider, since that log covers much wider search space. In case of the news agency, log file covers only agency’s web site; therefore variety of URLs is lower. Finally, number of stalled packages is an important parameter showing system’s ability to handle the input workload. Average number of stalled tasks has been measured for different system configurations, i.e. number of HOT boards, basic units (Fig 1 and 2), board memory and working frequency. Processing part consists of reconfiguration time needed for the board turn into a calculating sum algorithm, and back to collecting packets, and the time actually spent on calculations. Reconfiguration time is approximately 10ms. Time spent on calculations is O(n2), where n is the number of URLs in the memory. However, calculating time depends much of the number of occurrences of an URL in the memory, since more URLs will be processed at once if the number of occurrences is high. For smaller amounts of memories, calculating time is much less then 10 ms and reconfigurations are frequent. Therefore, for smaller amounts of memories number of stalled tasks tends to rise as memory drops. On the other hand for configurations with more memory, calculation time becomes much greater then reconfiguration time, and hence more
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
5
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
1.00E+03
2.50E-01
Stalled tasks (t/s)
1.00E+01
1.00E+00 1MB 512kB 256kB 128kB 64kB 32kB 16kB
8kB
4kB
Time for calculating sum (s)
Total time 1.00E+02
2.00E-01
Reconf. time (10 ms)
1.50E-01
1.00E-01
5.00E-02
1.00E-01
0.00E+00
On-board memory
Stalled tasks (t/s)
1.00E+04 1.00E+03 1.00E+02 1.00E+01
4k B
8k B
kB 16
kB 32
B
kB 64
8k
6k
B 12
25
51
2k
B
1.00E+00
On-board memory
Figure 11 - Number of stalled tasks for different buffer sizes (the Internet provider, two HOT II boards, four basic processing units, 25 MHz HOT II frequency). The input workload is 1.45× ×103 tasks/s. For more details, see caption for Fig 10. influential on the number of stalled tasks then number of reconfigurations. Number of stalled packages depending of the onboard memory is presented on the Figures 10 and 11, while the influence of the reconfiguration and calculation time on the overall processing time are presented on the Figure 12. From the presented figures, it appears, as there is an optimal amount of memory having the lowest number of stalled task. Nevertheless, one should keep in mind that the more memory is installed, the more processing is done on-board and the less is left for the database system, so the increase of the amount of memory in the system always brings the overall speedup. Additionally, the more memory present, the higher summing/reconfiguring time ratio is, and the efficiency of the system is higher.
4kB
8kB
16kB
32kB
64kB
128kB
On-board memory (MB)
Figure 12 - Average time spent on summing phase, depending on the amount of on-board memory. While for lower amounts of memory most of the time is spent on reconfiguration, summing process consumes significantly more time for higher amounts of on-board memory. Figures 13, 14 and 15 depict number of stalled tasks depending of various system parameters. Several architectural parameters have been observed in the previous paragraphs, and their influence on the performance. In addition to this, simulator has been used to determine the optimal load-balancing algorithm. Three possible options have been considered: load balancing based on IP number intervals, IP number modules, and server balanced scheduling. First algorithm assumes that the rightmost byte of the client’s IP address has uniform distribution in overall Internet addressing. Namely, IP numbers are also distributed geographically (a country, or a university gets its own class), hence it is likely that more developed areas will have more clients, and will express non-uniform distribution of leftmost bytes in the IP addresses. On the other hand, rightmost byte distribution is up to each sub network, and therefore express more uniform behavior. Therefore, the fist algorithm divides the range of 0-255 in as many intervals as HOT boards available, and assigns each interval to a 1.00E+03 1.00E+02 Stalled tasks (t/s)
Figure 10 - Number of stalled tasks for different buffer sizes (the news agency, two HOT II boards, four basic processing units, 25 MHz HOT II frequency). The input workload is 1.45× ×104 tasks/s. While for very low amounts of memory reconfiguration time is much greater then summing time, for higher amounts of memory summing time exponentially grows and number of stalled tasks grows. For lower workloads the simulation time was not long enough to demonstrate the same behavior.
1.00E+01 1.00E+00 20 MHz
25 MHz
30 MHz
1.00E-01 1.00E-02 1.00E-03 Board's working frequency
Figure 13 - Number of stalled tasks for different HOT II working frequencies, the news agency (two HOT II boards, four basic processing units, 256 kB URL buffer, workload 104 tasks/s)
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
6
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
Stalled tasks (t/s)
1.00E+03 1.00E+02 1.00E+01 1.00E+00 3 units
4 units
5 units
packet and monitor the status of each board. Another drawback is the decrease of the performance of the fake packet detection algorithm, since in this case not all the packets of the same client necessarily arrive on the same board. Figures 16 and 17 present simulation results of the comparison of the three proposed algorithms.
1.00E-01
4.3. Search simulation
1.00E-02 Number of basic units
Figure 14 - Number of stalled tasks for different number of basic units, the news agency (2 HOT II boards, 256 kB URL buffer, 25MHz working frequency, workload 104 tasks/s)
1.00E+04
1.00E+02 1.00E+01 1.00E+00 2 boards
3 boards
4 boards
1.00E-01 1.00E-02 Number of boards in the system
Figure 15 - Number of stalled tasks for different number of present boards, the news agency (four basic units, 256 KB URL buffer, 25MHz working frequency, workload 104 tasks/s) hot board. When a board receives a packet, it checks if the packet fits its interval, and if does, stores it into its memory, otherwise rejecting it. This way there is no need for communication among boards and between boards and CPU. The obvious drawback of the approach is that if the board is calculating sums at the time of the packet arrival, the packet is lost (or processed by CPU). Another observed drawback of the proposed algorithm is that network administrators tends to assign lower rightmost IP bytes numbers first, therefore raising number of clients within lower intervals, and increasing loads of the related packets. The second algorithm solves the problem by calculating remainder of the rightmost IP number divided by number of HOT II boards. Nonetheless, the problem of lost packages observed in the previous algorithm remains. The third algorithm is based on CPU-based packet scheduling, overcoming the major drawback of the previously mentioned algorithms. Namely, CPU checks for the first available board and transfers the received packet to that one. In this case a packet is lost (or processed by CPU) only if all boards in the system calculate sums at the time of the packet arrival. Nevertheless, the algorithm apparently introduces communication overhead since CPU has to receive every
Stalled tasks (t/s)
1 board
5.00E+03 4.50E+03 4.00E+03 3.50E+03 3.00E+03 2.50E+03 2.00E+03 1.50E+03 1.00E+03 5.00E+02 0.00E+00
Balanced Interval Module
1 board 2 boards 3 boards 4 boards Number of boards in the system
Figure 16 - Number of stalled tasks for different loadbalancing algorithms (the news agency’s log) 2.50E+04 Stalled tasks (t/s)
Stalled tasks (t/s)
1.00E+03
In order to get information about quality of the Obelix system search service, the MS Windows version Netscape Navigator web browser is modified so that it collects and sends the information about users’ actions to the Obelix server. The server side is at present implemented as a software-only application running on a Linux server that collects requests and stores them in a database. The search server is implemented as a web-application running on the same Linux server as the Obelix simulator. It is based on the Infoseek engine. The Infoseek search
Balanced Interval Module
2.00E+04 1.50E+04 1.00E+04 5.00E+03 0.00E+00 1.45e+04s
1.45e+05s Workload
Figure 17 - Number of stalled tasks for different loadbalancing algorithms (the Internet provider’s log). Note that the sample workload was not accurate enough for a case of 4 boards system. engines was chosen among many other Internet search engines since it outputs the accuracy of each found item. These accuracy is latter considered as a frequency score. The search server takes input queries from the web and forwards them to the Infoseek engine. Then it collects the result page and extracts URLs and percentage scores out
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
7
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
t Ar
C om pu te rs
Bo ok s
M
us
ic
16.00% 14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00%
Figure 18 - Comparison chart on percentage of successful searches on the Infoseek and the Obelix engine. The presented searches were conducted by 100 student voulonteers. 70.00%
Infoseek Obelix
60.00% 50.00% 40.00% 30.00% 20.00% 10.00%
t Ar
C om pu te rs
Bo ok s
us
ic
0.00% M
DB
where S is general score, Sf is the frequency score, and Sc is the Casselman score (note that there is no demographic score at present, due to large similarities among students in the target user group). The results are reranked according to the Casselman score and returned to the web user. First part of the testing was a data collection phase. Students at the University were provided with a modified version of Mozilla Navigator, and used it for their own web search. During that time, the system collected information about their browsing destinations and corresponding actions. After collecting data for a month, a test search has been conducted. Ten students have been asked to perform search on the Obelix and the Infoseek engine, since the differences between these two are solely in the Obelix algorithm. Students made test queries on five different topics: sport, music, books, computers and art. Then students graded result tests according to the usefullness of the searches. The results of the test pointed on several important issues that have to be considered when implementing an Obelix system: - Much bigger user action database is needed: A database of 1MB of user actions has been collected for a month of test run. Still, the amount of URLs in the database is incomparable with the Infoseek’s actual database. Therefore, during the test search, test users very seldom managed to obtain as a result of the query an URL that has already been filed in the database. Most of the result sets were the same as Infoseek’s, due to the large diversity of the data, hence we were unable to present meaningful numerical data about usefulness of the test. - High speed Internet link: While performing the query, the Obelix search engine has to fetch the result query from the Infoseek engine and process it. In order to get comprehensive result set, the Obelix has to retrieve the full result set which contains sometimes several thousands URLs for a single user request. - Flexible string matching: Sometimes several sites or even directories of a single site contain information of the same quality. Therefore, the fact that a result query contains an URL that is similar to the one in the database should raise that URL’s score. Another more obvious example is site and directory aliases. Namely, http://www.yahoo.com/Computers/Computer_books and http://dir.yahoo.com/Computers/Computer_books are exactly the same documents who’s URLs don’t match. - Keyword related database: Many words that are not related very much with the contents of a document might appear in its text. If a document is popular,
Sp or t
S = S f ⋅ ∑ Sc
frequently visited and bookmarked, it will have a high position in ranking even if its frequency score Sf is low. Therefore, the Obelix database should contain the information about the keywords related to the particular URL in each record. Nevertheless, this information is impossible to obtain when user simply types in an URL. Since most users start browsing from a search server, keywords typed in the search box can latter be associated with user actions as long as user continuously follows links. - Detailed documents first: Documents containing general information in general have more hits then specific documents, so they tend to have higher ranks. This problem can also be solved by introducing keywords for exact matching of specific keyword should influence overall score more then partial match. Numerical search results are presented on figures 18 and 19. Figure 18 represents the effectiveness of two search engines, the Infoseek with and without the Obelix. Figure 19 depicts how much more successful were the Obelix searches comparing to the Infoseek ones, in percentage.
Sp or t
of it. Afterwards, it calculates the Obelix results according to the formula:
Figure 19 - Improvement of the Obelix searches over Infoseek ones in percentage, for five different subject categories.
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
8
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
5. Conclusion A new approach to a document ranking system has been presented in the paper. Computer architecture, based on reconfigurable paradigm has been used for system implementation. Finally, the evaluation of the engine is presented, in comparison to the existing one. The architectural simulator of the system is implemented in order to test price/performance rate of the system. Several important conclusions can be made out of the simulation results: • A moderate hardware platform is needed to perform data collecting tasks. For up to 105 tasks per second, usage of system resources (CPU, PCI, disk), is less then 5%. Although the actual workload of running Obelix server is not known at present, it is expected that a server will be able to handle request, from a relatively large community (100,000 requests per second is roughly about 1,000,000 on-line users). • The longest phase in a processing cycle is the sum calculation process. For 512 KB of buffer memory (which is the case in the HOT II boards), this phase may last up to 0.5 seconds. This interval is even longer due to the reconfiguration time. Additional speedup might be achieved if avoiding reconfiguration. However, in the case of larger HOT buffers, reconfiguration time becomes insignificant. • All collected and processed information are latter stored onto the system’s hard disks. Additional processing is necessary, in order to sum up the data about the same URL coming from different boards at different time. This processing can be done as a batch process, or during the database search. However, it is important to eliminate the additional processing as much as possible. This can also be done by introducing more buffer memory to the HOT boards, which on the other hand raise the number of stalled packages. The optimal amount of on-board memory therefore depends very much on the underlying database system. • The best load-balancing algorithm is based on the CPU controlled balancing. This implies that a CPU has to collect all requests from the network and to transfer them to the first idle board. As already stated, the overhead is insignificant due to low usage of a CPU and PCI bus. Furthermore, it enables usage of fault tolerance, since all packages are already stored into the CPU’s memory.
• The main problem of the Obelix system is demographic search. In the case of demographic search, each received package has information about demographic data of a user. These data are unique in general, due to a large variety of collected information. It raises a question of usefulness of this kind of on-line processing. Although some benefits of on-line processing might be achieved in the case of very large amount of onboard memory, the usefulness is still unclear due to absence of appropriate input statistic. Namely, user group that has tested the Obelix system consists mainly of university students with similar demographic data and interests. Therefore, the results might greatly vary for different target user groups. The performance simulation of the system has showed the improvement of the search results using the presented techniques. Improvements vary, and go up to 10% for some subject categories. During the simulation, several other performance obstacles have been spotted, and will be addressed in the future work. Among these obstacles are: flexible string matching, keyword-related database, distinction between index documents and detailed documents. Nevertheless, the Obelix has proved itself as a valuable search engine offering more useful search results than conventional search engines.
6. References [1] J. Kleinberg, "Authoritative sources in a hyperlinked environment", Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. [2] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan, "Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text," Proceedings of the 7th World-Wide Web conference, 1998. [3] S. Brin, L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," Proceedings of the 7th World-Wide Web conference, 1998. [4] J. Cho, H. Garcia-Molina, L. Page, "Efficient Crawling Through URL Ordering," Proceedings of the 7th World-Wide Web conference, 1998. [5] DirectHit search engine's whitepaper, http://www.directhit.com/whitepaper.html [6] Virtual Computer Corporation HOT II product guide, http://www.vcc.com/Hotii.html
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
9
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
Appendix A How We Changed Mozilla Nenad Nikolic, Tijana Jovanovic, Dusan Milanov Department of Computer Engineering School of Electrical Engineering POB 35-54, 11120 Belgrade, Serbia, Yugoslavia E-mail:
[email protected],
[email protected],
[email protected]
Background
Modifying Mozilla
Back in 1998, Netscape decided to put his main product’s source code on the Web, free to download, as open source. Netscape Communicator has become open to everyone’s contributions and improvements. Mozilla, which is the code-name for the whole project, has dramatically increased its functionality ever since. It has opened various possibilities for researchers and developers.
Mozilla source is downloadable from the supporting web site, www.mozilla.org. It is now around 14.9MB, and supports Win32, Mac and Unix platforms. It is written in Visual C++ with MFC 4.2. We made changes in nearly twenty source and header files, and added a class for communication with the server, using Windows Sockets.
Problem Currently, search engines sort their results regardless of the quality and usefulness of indexed web pages. Sometimes, these results are heavily affected by holes in search algorithms and payments made by content providers. There is no feedback from the users of these pages – how satisfied they were with what the pages had to offer.
Idea The idea is to collect information on users’ satisfaction with the pages they visit, and to rank web pages more according to these criteria. Since Netscape made his Communicator’s source code publicly available, it turned possible to realize this idea.
Obelix’s Architecture • Client – modified Mozilla, which sends information on user actions over a page to a remote server. • Server –hardware implemented, PCI Board; collects information from clients, processes them and updates databases; enables better indexing of pages.
How To Build Mozilla The complete instructions on how to build the entire package can be found at http://www.mozilla.org/ build/win32.html. It takes approximately one hour to build Mozilla from scratch, and ten minutes for subsequent builds.
Server Simulator Before the PCI boards were made, we created a server simulator. Its initial purpose was to accept raw information without additional processing. It was improved to enable simulations on how PCI boards should be load balanced.
Do-It-Yourself • • • • • •
Download the patch from http://galeb.etf.bg.ac.yu/ rti/ebi/mozpatch.zip. Download the source Apply the patch (instructions included in How-To.txt) Build the lizard! Download the simulator from http://galeb.etf.bg.ac.yu/rti/ebi/obsim.zip. Set the port settings in Mozilla
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
10
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
Appendix B Possible Usage of H.O.T. II Boards for Implementing the Obelix Miroslav Nikolic Department of Computer Engineering School of Electrical Engineering POB 35-54, 11120 Belgrade, Serbia, Yugoslavia E-mail:
[email protected] The solution that we propose for the Obelix server is a hardware accelerator, based on the reconfigurable FPGA board, which can be put in a PCI slot of the server. Our choice is the H.O.T. II PCI board, made by Virtual Computer Corporation, the PCI plug-in card with the XC6216 RPU for computation and the XC4013 FPGA with a PCI Target Mode interface. The board also features 2MB fast 15ns SRAM, programmable onboard clock and PMC Mezzaine connectors. The XC6216 RPU has approx. 7K logic gates. Well written drivers and Hardware Object Programming API are responsible for good board's integration into the system. The H.O.T. board behaves like a closely coupled coprocessor. An XC6216 circuit consists of a large array of simple, configurable cells. Each cell contains a computation unit that is capable for implementing one of the sets of logic functions and routing area for inter-cell communication. The structure is simple, symmetrical, hierarchical and regular, allowing novice users to quickly make efficient use of the available resources. The first generation of the fine-grain architectures implemented only nearest-neighbor interconnection and had no hierarchical routing. The XC6200 family is the second generation that has introduced hierarchical cellular array structure. Adjcent cells are grouped into blocks of 4x4 cells, which form a cellular array, communicating with neighboring 4x4 cell blocks. 4x4 array of these 4x4 blocks forms a 16x16 block. In the XC6216 part, a 4x4 array of these 16x16 blocks forms the central 64x64 cell array which is surrounded by I/O pads. The board that has to be put into the Obelix has three basic configurations (phases) :
Phase 1. Collecting client statistics Clients collect information about user actions and send them via TCP/IP protocol to the Obelix server. The H.O.T. boards snoop the PCI bus and collect these information in the board memory. When the memory is full the board notify the Obelix software and reconfigure itself for the phase 2. Phase 2.
Calculating sums
This is a heavy computational phase, used for calculating the weighted sums and it is the most important reason for using this board for hardware acceleration. Multiple functional units are used during calculations. Every unit compares the URLs and calculates the weighted sum. Significant speedup can be obtain in this way. The second reason for implementing this algorithm in hardware is to decrease CPU occupancy. When the calculation is done the board notifies the Obelix software and reconfigures itself for the next phase - phase 3. Phase 3.
DMA transfer
After the calculations of weighted sums, data are transfered into the main memory or directly to the harddisk using a burst DMA transfer on the PCI bus, and reconfigure itself again for the phase 1 – Collecting client statistics. One of the advantages of this approach is also scalability. It is possible to have more then one board in the system at the same time, thus enabling the Obelix server for higher workload. One day, it should be also easy to implement all this using some other board, which will have faster RPU, more user memory and more logic gates. In that way more speedup can be obtain.
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
11