Trigger Grouping: A Scalable Approach to Large

0 downloads 0 Views 234KB Size Report
Dec 31, 2003 - to keep track of transient data such as stock quotes, product prices, news .... large number of triggers often share some of the pred- .... List created that links to entries in the trigger expres- sion (TE) list ... tion is omitted here.
Trigger Grouping: A Scalable Approach to Large Scale Information Monitoring Calton Pu Ling Liu Wei Tang College of Computing College of Computing College of Computing Georgia Inst. of Technology Georgia Inst. of Technology Georgia Inst. of Technology [email protected] [email protected] [email protected]

Abstract Information change monitoring services are becoming increasingly useful as more and more information is published on the Web. A major research challenge is how to make the service scalable to serve millions of monitoring requests. Such services usually use soft triggers to model users’ monitoring requests. We have developed an effective trigger grouping scheme to optimize the trigger processing. The main idea behind this scheme is to reduce repeated computation by grouping monitoring requests of similar structures together. In this paper, we evaluate our approach using both measurements on real systems and simulations. The study shows significant performance gains using the trigger grouping approach. Moreover, the gains are critically dependent on group size and group size distribution (e.g., Zipf ). We also discuss the benefit, trade-off, and runtime characteristics of the proposed approach.

1

Introduction

The World Wide Web has become a huge collection of data sources that change continuously. As the Web grows and evolves, information change monitoring services are becoming increasingly useful. Instead of having users remember when to visit Web pages of interest and manually identify what and how the pages have been changed, the change monitoring service will deliver change information while it is still fresh. This kind of service is specially suited for busy individuals to keep track of transient data such as stock quotes, product prices, news headlines, and weather conditions. The monitoring requests are often modeled as “soft triggers”. Designers of large-scale information monitoring and change notification systems face a common problem: most software systems are based on a pure requestresponse model that does not allow servers to asynchronously notify clients of events on server side. From system perspective, many distributed systems need the functionality of asynchronous event dissemination. Examples include callbacks in distributed file systems [1] and gossip messages in lazy replication sys-

tems [15]. It is desirable to design a general-purpose event dissemination infrastructure on which large-scale change monitoring systems can be implemented. Continual Queries (CQ) [11, 12] were proposed as a general technique to monitor updates from a variety of sources on behalf of users. When a CQ detects an update of interest, the new information is pushed to the user through a notification service. A frequently asked question in Web information systems is the scalability of a service when the number of users and the number of monitoring requests grow into millions. A common observation is that when the number of users increases, common interests among the users tend to grow as well. In this paper, we describe a trigger grouping approach to efficiently process triggers and achieve higher system scalability. Our study is performed in the context of Continual Queries. However, the research framework and proposed algorithms should apply to general information change monitoring services. The contribution of this paper is three-fold: 1.Introducing a general framework for building large-scale information monitoring systems. 2. Proposing alternative trigger grouping schemes based on three grouping types: S-type, C-type, and OC-type. Detailed data structures are presented in the paper. 3. Comprehensive performance study of the trigger grouping approach. Our study demonstrates the grouping benefit as a function of group size and group size distribution in the presence of various cost factors. The paper studies two types of workload for group sizes: uniform distribution and skewed (Zipf like) distribution. The remainder of the paper is organized as follows: Section 2 gives an architectural overview of Continual Query systems; Section 3.2.2 introduces the main algorithm and data structures for trigger grouping in the CQ context; Section 4 presents experimental study of our proposed approach; Section 5 compares related work; Section 6 concludes the paper.

2

Continual Query Overview

We define a continual query as a quadruple (Q, Tcq , Notif, Stop), consisting of a normal query Q (e.g., an SQL-like query, an XQuery, or a keyword-based query), a trigger Tcq , a notification component Notif, and a termination condition Stop. In contrast to ad hoc queries in conventional DBMSs or Web search engines, a continual query, once activated (i.e., installed and started), runs continually over the set of information sources. The trigger part of a continual query specifies the events and situations to be monitored. It could be time-based, e.g., at 10am every Tuesday; or content-based, e.g., IBM’s stock price changed by one dollar. Example 1 Consider continual query “Notify me ([email protected]) whenever the daily high price of Microsoft stock drops by 5%”. It can be expressed as follows: Create CQ Microsoft_stock_watch as Query: EXTRACT companyname, daylow, dayhigh, volume FROM Stock CONDITION symbol = ’MSFT’; Trigger: AT SOURCE [email protected] WHEN stock.high DECBYP 5 WHERE stock.symbol = ‘MSFT’; Notif: email([email protected]); Start: 9:00:00 am, Feb. 19, 2003; Stop: 9:00:00 am, Dec. 31, 2003; Calendar: Monday-Friday, 8:00:00 am-5:00:00 pm;

When a continual query is installed, a persistent CQ object will be created. A change detection query and a trigger evaluation expression will be derived from the trigger component. The CQ is classified into one of the processing groups according to identified patterns in its trigger expression (explained in details in Section 3.2.2). The runtime daemon is in charge of monitoring information source changes and checking of user-defined thresholds to react to these changes. Once the trigger condition becomes true, the CQ query is activated to fetch the fresh information from data sources. If the query result is different from the last CQ execution, the notification module will be invoked to check if a notification needs to be sent. Trigger evaluation is the first step and most important one because query and notification happen only when trigger becomes true. Thus, a significant portion of CQ costs is in the periodic testing of trigger condition, with the remote access to fetch the new data as the primary cost factor. As the most important step toward building a scalable CQ system for information monitoring, we propose an optimization mechanism for continual query processing, namely CQ trigger grouping based on trigger patterns.

requests may share common components. For example, they could be monitoring the same information or have overlapping interests (partial match for trigger expression), e.g., one user is interested in IBM and Microsoft’s stock price changes while another is monitoring both IBM and Nokia’s stock prices. The main idea behind continual query grouping based on trigger patterns is that many of the trigger expressions share the same or similar structures. Therefore, we adopt an idea similar to inverted files and poll information sources for new information only once for all the CQs that monitor that source. This way, the monitoring cost for the CQ server and data sources remain constant, no matter how many users want to monitor the new information from that source. The trigger grouping mechanism consists of three main steps. First, all CQs upon their entry to the CQ system will be transformed into a canonical format. Such canonical transformation will be fed into the trigger pattern discovery step. Second, one of three alternative algorithms of varying granularity is used. Each CQ trigger is classified into a particular group according to its extracted trigger pattern. Third, the CQ runtime engine will make use of the group information for trigger condition testing and evaluation.

3.1

Canonical Transformation

Canonical transformation of trigger expressions is to identify rules to map triggers to trigger patterns. This step is necessary for the grouping based on trigger patterns in a later stage of CQ processing. As shown in Section 2, trigger conditions of continual queries have three common clauses: AT SOURCE, WHEN, and WHERE. The canonical transformation transforms the WHEN clause and the WHERE clause into a canonical representation for later grouping. The representation consists of two parts: Object Event (WHEN clause) and Event Context (WHERE clause). The process is described as follows: 1. Translate the WHEN clause to a conjunctive normal form (CNF), i.e., the and-of-ors notation. 2. Translate the WHERE clause to a conjunctive normal form, similarly to the previous step. Move all conjuncts containing non-equality predicates to the WHEN clause by connecting the two parts with “∧” (logical AND). 3. Group the conjunctive terms by the set of object classes and data sources they refer to in WHEN clause.

4. If the entire WHEN expression has p constants, they are numbered 1 to p from left to right. Consider the con3 Continual Query Trigger Grouping stant number k, (1 ≤ k ≤ p). If it appears in Cij in the With thousands of people installing thousands or original expression and Cij is of the form “Aij op vij ”, even millions of CQs in a large-scale information monthen the number k constant vij in Cij is substituted with the generic term CONSTANTk . itoring system, chances are that a lot of monitoring

5. If the entire expression is a selection predicate group and it has q operators, they are numbered 1 to q from left to right. For the operator number h (1 ≤ h ≤ q) of the form “Aij op vij ”, the operator op is substituted with the generic term OPERATORh .

[email protected]

AT SOURCE

Trigger Predicates: CQ1 Stock.Symbol=’IBM’ Stock.price

Stock.price

100

=

CQ5 Stock.Symbol=’IBM’

CQ3 Stock.Symbol=’IBM’ =

102

INCBY

98

>=

CQ4 Stock.Symbol=’IBM’

CQ2 Stock.Symbol=’IBM’ Stock.price

Stock.price

2

Stock.price

3.2 3.2.1

Alternative Grouping Algorithms

CQ1:

EO1

Op1 Const1 W1 S1

CQ2:

EO1

Op2 Const2 W1 S1

CQ3:

EO1

Op1 Const3 W1 S1

CQ4:

EO1

Op3 Const4 W1 S1

CQ5:

EO1

Op3 Const4 W1 S1

CQ6:

EO1

Op1 Const1 W1 S1 EO2

Stock.Symbol=’IBM’ 100

=

Stock.volume >

98

>=

340000

AT SOURCE S1 Event Object

[email protected]

OPERATOR Op1

EO2 Stock.volume

Op2 INCBY

Const2 2

Op3

>=

Const3 102

Op4

>

Const4 98

^

=

WHERE

CONSTANT

EO1 Stock.price

Const1 100

W1 Stock.Symbol=’IBM’

Const5 34000

Op4 Const5 W1 S1

Trigger Grouping: S−type S−TP1

C−type CQ1

S−TP2

CQ2

S−TP3

CQ3

S−TP4

CQ4

S−TP5

CQ6

OC−type

Constant List

Operator List

100 102 CQ3

C−TP1

CQ1 CQ5

S−TP1: Stock.price = 100 S−TP2: Stock.price INCBY 2 S−TP3: Stock.price = 102 S−TP4: Stock.price >= 98 S−TP5: Stock.price = 100 AND Stock.volume > 340000

5 groups

Types of Subscription Groups

The main idea behind CQ trigger grouping according to trigger patterns is based on the premise that a large number of triggers often share some of the predicate variables but may take different constant values or different operations in their trigger condition testing. When many triggers are on the same data source, the cost of fetching remote data can be reduced if we group the accesses to common objects together. Currently, we only consider grouping trigger expressions with the same event context. According to the granularity of trigger pattern grouping, we develop the grouping strategies based on three types of trigger patterns: S-type (Simple-type), C-type (Constant-type), and OCtype (OperatorConstant-type). Figure 1 gives an example of how triggers of different CQs are grouped according to the three types of grouping algorithms. The triggers of CQ1 -CQ6 are on the same source ([email protected]). After the canonical transformation, a trigger pattern is identified as “Stock.price OPERATOR CONSTANT”, with the operators and constants numbered accordingly. S-type trigger grouping is the simplest trigger grouping strategy, that is, we group continual queries having the exact trigger pattern (same data source, trigger object, trigger context, and operator). This happens when users have the same monitoring specification over the data objects. This type of grouping has the finest granularity among all types of grouping algorithms since it requires exact match of trigger expressions for CQs in the same group. Therefore, it results in smallest group sizes. An example of an Stype pattern instance is “Stock.price = 100”. C-type trigger grouping relaxes the grouping algorithm on the constant matching in the trigger pattern. For example, when two uses are monitoring IBM stock information, with one set the trigger on “Stock.price = 100” and the other on “Stock.price = 102”, they are grouped together since they can share the same change

^

Stock.Symbol=’IBM’ Stock.price

Canonical Form:

The output of canonical transformation is a list of rewritten CQs with compact forms. The system maintains several hash tables in memory for fast identification of trigger groups in the grouping step. Example output for canonical transformation is shown in Figure 1. Common structures are easy to be identified and grouped together at later stages. For example, CQ1 and CQ3 share most parts of the canonical forms except for the constant part.

CQ6

C−TP2

2

C−TP3

98

OC−TP1

CQ2

CQ4

=

>= INCBY

CQ6

CQ5

C−TP1: Stock.price = CONSTANT C−TP2: Stock.price INCBY CONSTANT C−TP3: Stock.price >= CONSTANT

3 groups

2

CQ2

98

CQ4

CQ5

100

CQ1

CQ6

102

CQ3

OC−TP1: Stock.price OPERATOR CONSTANT

1 group

Figure 1: Examples for three types of trigger grouping detection query (on IBM’s price). The trigger pattern instance is “Stock.price = $CONSTANT”. OC-type trigger grouping relaxes on both the operator part and constant part in a trigger pattern when considering grouping. That is, triggers having the same change detection query but different operators and values are grouped together. In this case, all 6 CQs are grouped together in one group1 . The trigger pattern instance is “Stock.price $OPERATOR $CONSTANT”. 3.2.2

Algorithm Overview

The life cycle of a continual query involves several steps: subscription installation, event detection, trigger condition testing, query, and notification, where query and notification can be optional. CQ subscription installation includes making the continual query a persistent object, canonically transform it, identify the trigger patterns, and classify the CQ into a particular group. Event detection is a ever-running process, which is to detect changes happening at the data sources. Trigger condition testing can also be called trigger expression evaluation. Query gets executed once the trigger condition testing returns true. Finally, notification is activated when query is finished (for punctual queries) or when query result is different from the last snapshot (for differential queries). The CQ subscription clustering (a.k.a CQ grouping) algorithm is described in Table 1. The trigger evaluation of a continual query is divided into the following steps: 1 Note, for CQ4, the trigger has two predicates, one is “Stock.price = 100”, the other is “Stock.volume > 34000”. Based on selectivity, we choose the first predicate for grouping so that the second predicate is the non-indexable part.

INPUT: //CQ subscription contains Query,Trigger,Notif,and Stop s = [Q, T , N, Stop ] C = {existing subscription clusters} Lc = {global constant list} Lop = {global operator list} BEGIN:

END.

t = canonical transf orm(T ) G = get cluster type() // S, C, or OC type case G of ’S’: // nothing ’C’: update list(Lc , get constant(t)) ’OC’: update list(Lc , get constant(t)) update list(Lop , get operator(t)) //get the trigger pattern under current clustering alg. p = extract pattern(t, G) if ((ck = match cluster(p, get pattern(C))) == FALSE) then { // find new trigger pattern c = make cluster() insert(c , s) C = C ∪ c’ } else insert(ck , s)

Table 1: CQ subscription clustering algorithm

and operator list. The data source hash is for fast matching between source names and sources IDs. For each source, a predicate index is maintained, indicating predicates on the particular data source. Every predicate in turn will be mapped to specific trigger patterns (TP). Based on the grouping strategy chosen, for each trigger pattern containing one or more generic constant terms may need a separate Constant List created that links to entries in the trigger expression (TE) list (C-type trigger grouping). For OC-type trigger grouping, an additional Operator List will be created for every entry in the trigger expression list, with each entry in the operator list pointing to a separate constant list. Figure 2 illustrates the related data structures. Data Source Hash Table

1. Selecting grouping type (S, C, or OC): Currently, only one trigger grouping strategy can be applied at one time. Based on system load and deployment considerations, we can switch between different grouping strategies. 2. Canonical transformation of CQ subscriptions

S1

Predicate Index by Source

S2

Stock.price

Trigger Pattern List

TP1 S

S3

S4

>=

100

TP2

TP3

C

OC

Sk

Obj.F

Op

CONST

Obj.F

Op

CONST

TPm

S−type Trigger Grouping

C−type Trigger Grouping Constant List

3. Trigger pattern recognition: All trigger patterns are recognized, with unique pattern ID (signature) assigned to each pattern. CQs with triggers of same pattern will be clustered into the same group. The system maintains a trigger pattern list. If a new pattern is identified, a new entry is inserted into the list, with a pointer to the new subscription. Trigger patterns can be classified into S-pattern, C-pattern, and OCpattern according to the grouping strategy we choose.

5. Trigger expression testing: Once the individual event testing is done, the CQ trigger expression consisting composite events is tested and the result is either true or false. Query component is only activated when the trigger expression is evaluated to be true. 3.2.3

Implementation Consideration

The CQ trigger grouping algorithms use a set of indices: data source hash table, trigger predicate index for each source, trigger pattern lists, constant list,

CQ1 CQ2

TP2

CQ1 CQ2

CQr

TP1

TP3

CQid TPn

CQ1 CQ2

V1 V2

Vs

TP2

CQ1 CQ2

TP3

CQ1 CQ2 CQ1 CQ2

Trigger PatternID Expression CQ3

TPn

CQr

CQ3

Trigger CQid PatternID ConstID Expression

OC−type Trigger Grouping Operator List Op1 Op2

TP1

Ops

TP2 TP3

TPn

Constant List

4. Event detection query generation: Each trigger in CQ subscription is decomposed into two parts: event detection and event testing. For example, if a trigger reads “AT SOURCE [email protected] WHEN stock.high DECBYP 5 WHERE stock.symbol=’MSFT”’, the event detection query will be constructed to fetch the remote page on stockmaster.com for the value of Microsoft’s stock price (high). The testing will be simply comparing this current value against ’5’ with operator DECBYP. Detection queries accessing the same page will be grouped together in this step for reduced network delay.

TP1

V1

CQ1 CQ2

V2

CQ1 CQ2

Vt

CQid

CQr

Trigger PatternID OpID ConstID Expression

Figure 2: Illustration of CQ trigger grouping algorithms In short, whenever a trigger pattern is derived from a newly installed CQ, the system will check to see if its signature can be found in table TriggerPatternSig. If the trigger pattern is new, it will be added into the table. If this signature has at least one generic constant term, a constant table is created for this trigger pattern signature with a entry added to record the indexable part, the list of constant-operator pairs, and the non-indexable part of this newly installed CQ. Due to space limitations, the detailed implementation is omitted here.

4 4.1

Experiments and Results Experiment Setup

The experiments were done on a combination of the prototype OpenCQ engine with a simulated CQ system for scalability experiments. The simulator uses the same code from the prototype OpenCQ implementation, but it is able to “run” a much higher number of

stock.last > 108 WHERE stock.symbol = ’IBM’ AND CQs than native OpenCQ. The result is that we prestock.volume > 200000 WHERE stock.symbol = ’IBM’ pared a synthetic workload for most of experiments. The time for installing CQs and loading them into • Type 4 trigger (join operator) → memory is not counted in the total running time. EvExample: tell me Nokia’s stock quote whenever some ery experiment was run 10 times and the average runtelecommunication companies appear in the news headtime result was computed to mitigate the effect of inilines and whose last trading prices are over $45. tialization cost (e.g., system cache fill-up and Just-InCQ Trigger: AT SOURCE [email protected] WHEN stock.last > 45 WHERE stockNews.title LIKE ’%TelecomTime compilation of Java methods) and Java garbage munication%’ ∧ stockNews.stocksymbol = stock.symbol collection. The system hardware consists of a Sun Enterprise Although the four types of triggers differ in terms 450 server with 4 400-MHz Ultra SPARC-II procesof evaluation times, they have similar running characsors, 1GB of memory, and a 100Mbps Ethernet local teristics. Therefore, when presenting the experimental area network. The system run-time software consists results, if not stated, the analysis applies to all types of of Oracle 9i Enterprise DBMS running on Solaris 7. triggers. The simulation workload consists of triggers OpenCQ engine and simulator are both implemented with their types evenly distributed. in Java (JDK1.3). We study two types of workloads in terms of num4.2 Workload ber of CQs within each group after grouping: fixed The test data were collected from and and skewed. For fixed workload, incoming CQs are at the beginning of February uniformly distributed among groups of equal size af2000. Stock symbols were extracted from Yahoo! Web ter the CQ grouping algorithm. For skewed workload, pages and quotes were extracted from DBC.com. Two the group sizes are distributed according to Zipf-like simulated data sources were set up (Stock, quotes over distributions after the grouping algorithm. In our ex8337 symbols, and StockNews, with news information periments, the probability density function used for about companies represented in the set of stock symgeneral Zipf-like distribution is P (i) = C/iα where C bols) to satisfy the requirements of scalability experis a normalizing coefficient. iments in the simulator. The simulated data sources The intuition behind Zipf-like distribution is that reside in local area networks to avoid the fluctuation in a large number of users have common interests (sharaccess time when fetching remote data. For simplificaing a small number of CQs on the same subjects), a tion of the evaluation, instead of sending out emails to large number of CQs with a small number of users inusers, the notification components are simply writing terested in each, and a moderate number (less than the notification results in a log file at the server. linear) of users for CQs in-between. Previous research We studied four atomic trigger types in our experhas shown that the distribution of Web page requests iments. Type 1 trigger has an arithmetic comparison follows Zipf-like distribution [8, 9, 10], with α varying operator, including 110 WHERE stock.symbol = ’IBM’ • Type 2 trigger (cache comparison operator) → Example: tell me IBM’s stock quote whenever its last trading price is changed by 5%. CQ Trigger: AT SOURCE [email protected] WHEN stock.last INCBYP 5 WHERE stock.symbol = ’IBM’ • Type 3 trigger (multiple conditions) → Example: tell me IBM’s stock quote whenever its last trading price is over $108 and the volume is more than 200000. CQ Trigger: AT SOURCE [email protected] WHEN

TF QF NF NG T Cg HQ α

percentage of triggers fired percentage of queries fired percentage of notif. fired number of groups grouping cost query cache hit ratio Zipf popularity parameter

100% 0-100% 0-100% 1-Ncq 50-200ms/group 20%-40% 0.4-1.6

We measured the system performance in terms of elapsed time and system throughput (CQs/sec). The ultimate goal is to increase the overall system throughput for evaluating all installed continual queries without causing too much delay in response time for individual CQs.

4.4

Performance Results

A “typical” CQ execution time breakdown is shown in the following table (measured for Type 2 trigger).

Trigger 42.8ms 18.8

Query 61.1ms 17.4

Notif. 7.3ms 1.8

Bookkeep 11.4ms 10.3

We organize the rest of this section as follows: Section 4.4.1 studies the performance characteristics of different types of triggers and benefit of grouping with fixed group size. Section 4.4.2 studies the impact of data source size change. Section 4.4.3 discusses the effect of fixed group size workload on trigger grouping performance. Section 4.4.4 discusses the effect of skewed group size workload on trigger grouping performance. Finally, Section 4.4.5 analyzes various sources of grouping costs. Varying Triggers and Fixed Group Size

The benefit of CQ trigger grouping is shown in Figure 3.

200 150 100 50 0 200 2000 4000 6000 8000 100001200014000 Number of CQs

Non−grouping Groupsize=10 Groupsize=50

200 150 100 50 0 200 2000 4000 6000 8000 100001200014000 Number of CQs

Example: tell me IBM’s stock quote whenever the average stock trading volume exceeds 340000. Trigger: AT SOURCE [email protected] WHEN AVG(stock.volume) > 340000

Type 2 trigger (cache comparison op)

250

Non−grouping Groupsize=10 Groupsize=50

200 150 100 50 0 200 2000 4000 6000 8000 100001200014000 Number of CQs Type 4 trigger (join operator)

Trigger Evaluation Time (seconds)

Trigger Evaluation Time (seconds)

250

In this experiment, we study how different data sizes affect the evaluation of continual queries. Two types of triggers are studied: type 1 trigger discussed in the previous section, and trigger with an aggregation function (an example is shown below).

300

Type 3 trigger (multiple condition) 300

Impact of Data Size

300 250

Non−grouping Groupsize=10 Groupsize=50

200 150 100 50 0 200 2000 4000 6000 8000 100001200014000 Number of CQs

Figure 3: Grouping benefit for different triggers In each of the graphs, we demonstrated the benefit of CQ trigger grouping with varying group sizes compared with the non-grouping case. The figure shows clearly that trigger grouping can cut the evaluation time in orders of magnitude according to the group sizes. When group size is larger, the trigger evaluation time is shorter. This is because we only need to evaluate the common part of multiple trigger once for a group of CQs. The total evaluation time is linear to the number of CQs in the system. The execution time saved is proportional to the group size increase. For example, for trigger type 1, when group size increases from 10 to 50, the evaluation time dropped from 94.7 seconds to 18.4 seconds. However, group size of 10 does not mean the grouping evaluation time can get to one tenth the cost of non-grouping evaluation. This is due to the presence of grouping cost, which is mainly the time needed to access various grouping indices. Grouping cost will become minimal once we preload all the index tables into main memory. In Section 4.4.5, we will study grouping cost in details.

Type 1 trigger (Ncq=14000)

Trigger with aggregation (Ncq=14000)

Non−grouping Grouping (Gs=10)

4000

Trigger Evaluation Time (sec)

250

Non−grouping Groupsize=10 Groupsize=50

Trigger Evaluation Time (seconds)

Trigger Evaluation Time (seconds)

Type 1 trigger (arithmetic comparison) 300

4.4.2

3000 2000 1000 0 200 600 10001400 2000 Data source size in Kbytes

Non−grouping Grouping (Gs=10)

4000 3000 2000 1000

0 200 600 10001400 2000 Data source size in Kbytes

3600

3600

Figure 4: Data Size Impact on Trigger Grouping Figure 4 compares non-grouping and grouping cases for the two types of triggers when data size changes in terms of elapsed time as well as throughput. Data size increase brings up the evaluation times for both types of triggers. However, the impact on type 1 trigger is far less significant than that on trigger with aggregation. However, the data size change does not affect the performance under grouping significantly. The system performance is more stable using trigger grouping on data sources of different sizes. 4.4.3

Impact of Fixed Group Size Throughput for 4 types of triggers 2000 Throughput (triggers/sec)

4.4.1

Figure 3 also demonstrates that different triggers have different evaluation costs (higher cost triggers have steeper performance curves). Triggers with join operator are generally more expensive than other types of triggers. This is because this type of trigger involves fetching data from multiple data sources.

Trigger Evaluation Time (sec)

Mean Elapsed time Standard Dev.

Non−grouping Gs=5 Gs=10 Gs=50 Gs=100 Gs=200

1500

1000

500

0

1

2

3

4

Trigger Type

Figure 5: System throughput with fixed group sizes Figure 5 demonstrates how different settings on the fixed group sizes affect the throughput of CQ trigger processing. As we can see, given a fixed workload on CQ trigger grouping when group size is 200CQs/group, the CQ system can evaluate up to 2812, 2202, 1235,

and 541 triggers/second for the four types of triggers. While for fixed group size of 5CQs/group, the numbers are only 74, 53, 30, and 13, respectively. The trigger evaluation time is inversely proportional to the size of the groups. 4.4.4

Impact of Skewed Group Size

We have presented above the performance characteristics for CQ trigger grouping under the fixed workload. In this section, we study the trigger grouping under the Zipf-like workload. Grouping speedup v.s. total # of CQs and alpha parameter

40

Figure 7: Speedup ratio under uniform workload

35

Speedup ratio

30 25 20

mation about CQ group subscriptions. There are three kinds of grouping costs: installation cost, runtime cost, and maintenance cost.

15 10 5 0

• Installation Cost: At installation time, before a new continual query is made persistent, we have to identify the group to which this particular CQ belongs. A new group will be created if necessary. A proper group ID will be assigned to this CQ and associated group data Figure 6: Trigger grouping speedup for skewed workstructures will be updated. load 1.6

1.4

2.6

1.2

1

2

0.8

1.4

0.6

0.4

Alpha value

4

x 10

0.8

0.2

Total # of CQs

For Zipf-like workload, a high α value indicates that • Runtime Cost: At runtime, when CQs are being evaluated, the system has to first retrieve grouping informapopular objects are very popular, and unpopular obtion from group index tables on the fly. The grouping jects are very unpopular. It means that the bigger cost is namely the lookup time for each group index groups are much larger. Accordingly, a low α value intable. In a distributed environment, we cannot assume dicates a more homogeneous user subscription stream the information about each group will reside in main (user interests are widely spread), with many users inmemory of one host machine. Extra cost will thereterested in many different data objects. The number fore include networking delays when we retrieve group of groups tends to be higher and group sizes smaller information from remote hosts. For example, Figure 8 when α value is lower. illustrates the trigger grouping speedup ratios in the Figure 6 shows the speedup for trigger grouping unpresence of runtime grouping cost. It can greatly imder skewed workload. The speedup is higher when α pact the CQ system performance. In the performance is higher. This is because a higher α value indicates study of this paper, the grouping cost refers to this a more heterogeneous subscription group configuration runtime cost. (larger groups are much larger than smaller ones) Thus there are fewer groups in the system after grouping, • Maintenance Cost: We may need to adjust continwhich in turn saves more evaluation time. ual query grouping at certain times when system performance degrades. This happens when group sizes 4.4.5 Costs of grouping change among CQ groups. We observe that when The presence of grouping cost is illustrated in Figure 7. group sizes are too small, CQ grouping does not help Although analytical study shows the grouping speedup the system performance at all. We take trigger groupto be close to the group size, real measurement of the ing for an example. In Figure 8, grouping is not bensystem can only get up to a proportion of the analytieficial any more when for situations falling below the cal value (e.g., when group size is 180CQs/group, the dashed baseline. For fixed group size workload, when speedup ratio is only close to 30). The difference ingroup size falls below 5 CQs/group, grouping does dicates the presence of extra cost using the grouping not benefit at all under grouping cost of 200ms/group. scheme. When the group size is smaller than 2 CQs/group, trigThe reason of grouping cost is because extra data ger grouping performs worse than the non-grouping structures are needed to store and retrieve the inforcase under all three configurations of grouping cost.

Similarly, for the skewed workload when group sizes follow Zipf distribution, cost of grouping has to be considered when the grouping performance curves fall below the baseline. We may need to change the current grouping algorithm to a coarser one (e.g., from S-type to OC-type trigger grouping) dynamically.

Fixed workload (Ncq=14000)

Skewed (Zipf) Workload (Ncq=14000)

25

25 Grouping w/o cost Grouping w/ cost Non−grouping

15 10 5 0

Grouping w/o cost Grouping w/ cost Non−grouping

20 Speedup ratio

Speedup ratio

20

15 10 5

1

2

5 10 20 30 40 Groupsize (CQs/group)

50

0

0

0.2 0.4 0.6 0.8 1 Alpha Value

1.2 1.4 1.6

Figure 8: Grouping benefit threshold

5

Related Work

Some recent research work on continual/continuous queries includes [7, 3]. But they are focusing on grouping of queries instead of triggers. Their continual query specifications are simpler than ours. A set of matching network structures [2, 4, 13]. have been proposed in AI production systems research. The unique data structures are used for efficient matching and evaluating predicates in the rule conditions. In CQ systems, due to complicate data change monitoring requirements and specific architectural design, we cannot apply these techniques directly. System specific structures are designed to work with other system components in evaluating continual queries. Our work is also related to general publish/subscribe systems and filtering services [6, 14], in which efficient matching and filtering algorithms are crucial.

6

Conclusion

In this paper, we present an effective grouping approach to processing information monitoring requests. The main idea behind this approach is to reduce repeated computation by grouping monitoring requests of similar structures (trigger patterns). We discuss alternative grouping algorithms based on three grouping types: S-type, C-type, and OC-type. We studied the approach in the context of continual queries while the general framework works for general information monitoring services. We also demonstrate a detailed performance study on the benefit and trade-off of the proposed approach. While trigger grouping is the first step towards building a scalable information monitoring system (such as CQ systems), more optimization techniques need to be studied to achieve higher parallelism in the system. Our ongoing research includes developing a suite of integrated mechanisms for large-scale information monitoring, such as query caching, notification

grouping, and effective change detection and difference algorithm using page digest [5].

References [1] B. Satyanarayanan. Scalable, Secure, and Highly Available Distributed File Access. IEEE Computer, Vol.23, pp.9-21, May 1990. [2] C.L. Forgy. RETE: A fast algorithm for the many pattern/many object pattern match problem. Artificial Intelligence 19(1): 17-37, September 1982. [3] S. R. Madden, M. A. Shaw, J. M. Hellerstein, and V. Raman. Continuously Adaptive Continuous Queries Over Streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2002. [4] D.P. Miranker. TREAT: A better match algorithm for AI production systems. In Proceedings of the Sixth National Conference on Artificial Intelligence (AAAI-87), pp 42-47, August 1987. [5] D. Rocco, D. Buttler, and L. Liu. Sdiff. Technical Report, Georgia Tech, College of Computing, February 2002. [6] F. Fabret, H. Jacobsen, F. Llirbat, J. Pereira, K. A. Ross, and D. Shasha. Filtering algorithm and implementation for very fast publish/subscribe systems. In Proceedings of ACM SIGMOD Conference, 2001. [7] J. Chen, D. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for internet databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2000. [8] J. Zhang, R. Izmailov, D. Reininger, and M. Ott. Web Caching Framework: Analytical Models and Beyond. IEEE Workshop on Internet Applications, 1999. [9] K. Sripanidkulchai. The popularity of Gnutella queries and its implications on scalability. http://www2.cs.cmu.edu/ kunwadee/research/p2p/gnutella.html [10] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web Caching and Zipf-like distributions: evidence and implications. In Proceedings of IEEE Infocom, 1999. [11] L. Liu, C. Pu, and W. Tang. Continual queries for Internet-scale event-driven information delivery. IEEE Knowledge and Data Engineering, Special Issue on Web technology, 1999. [12] L. Liu, C. Pu, W. Tang WebCQ: Detecting and Delivering Information Changes on the Web. In Proceedings of 9th International Conference on Information and Knowledge Management (CIKM), Washington D.C., 2000. [13] M.A. Kelly and R.E. Seviora. An evaluation of DRETE on CUPID for OPS5 matching. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (IJCAI-89), pp 84-90, Detroit MI, August 1989 [14] M.K. Aguilera, R.E. Strom, D.C. Sturman, M. Astley, and T.D. Chandra. Matching events in a contentbased subscription system. In Proceedings of the 18th ACM Symposium on Principles of Distributed Computing (PODC), 1999. [15] R. Ladin, B. Liskov, L.Shrira, and S. Ghemawat. Programming high availability using lazy replication. ACM Transactions on Computer Systems, 10,4, pp360-391, Nov. 1992 .

Suggest Documents