This preview shows page 1. Sign up to view the full content.
Unformatted text preview: 4 Query Flocks
Goal: apply apriori trick and other associationrule tricks to a more general class of complex queries.
4.1 Query Flock Notation A query ock is a generateandtest system consisting of: 1. A query with parameters; we write the query in Datalog to simplify certain optimizations later. 2. A lter condition that says when the values of the parameters yields a query result that we accept. Note that the query ock is really a single query about its parameters; the parametrizedquery component is not the real query. Example 4.1 : Frequent item pairs in a relation BasketsBID; item can be written as the query ock:
Answerb  Basketsb,$1 AND Basketsb,$2 =s COUNTAnswer If we replace parameters $1 and $2 by values, e.g., diapers" and beer," respectively, then the query is asking for the set of basket ID's such that the basket contains both diapers and beer. The condition on the answer says that there must be at least s such baskets, where s is the support threshold. Thus, this query ock asks the usual question about the parameters $1 and $2: which pairs of items appear in at least s baskets?" 2 Example 4.2 : Here is a less usual example. It supposes relations: 1. Custname; attr; value. Tuple n; a; v means the customer with name n has value v for attribute a. For instance, Sue; age; 45 means that Sue is of age 45. 2. Buysname; prod tells what products each customer buys. 3. Typeprod; type tells the type of each product, e.g., product Coke" is of type soft drink." Here is the query ock that asks for values of some attribute that occur at least s times among buyers of a certain type of product:
Answern  Custn,$a,$v AND Buysn,p AND Typep,$t =s COUNTAnswer 2
4.2 Execution Strategies The analog of apriori is the observation that if we delete one or more subgoals from a Datalog query, the size of the set of answers can only increase. Our hope is that by computing some temporary relations using a subset of the subgoals, we can lter the sets of values for one or more parameters, using computations that are much less expensive than computing the entire query about the full set of parameters. We can describe the intermediate steps, as well as the nal computation of the parametervalues that pass the test by a sequence of steps of the form
Relation := FILTER parameters , query , condition The query is the ock query, with zero or more subgoals eliminated. A requirement is that this query be safe ; i.e., every variable appearing in the head appears in a nonnegated subgoal involving a relation i.e., not a subgoal involving an arithmetic comparison like a b. 13 The parameters are those appearing in the query. The condition is the same as the condition of the ock itself. Example 4.3 : The ock of Example 4.1 might be solved by using the rst subgoal to lter $1 and the second subgoal to lter $2.
OK1$1 := FILTER $1 , Answerb  Basketsb,$1, COUNTAnswer = s OK2$2 := FILTER $2 , Answerb  Basketsb,$2, COUNTAnswer = s OK$1,$2 := FILTER $1,$2 , Answerb  Basketsb,$1 AND Basketsb,$2 AND OK1$1 AND OK2$2, COUNTAnswer = s Of course a clever ocks compiler recognizes that these two ltering steps are really the same and only computes one of OK1 and OK2. The reason apriori often saves a lot of time is because the join of four relations at the last step computation of OK $1; $2 can be carried out in an order that reduces the size of intermediate relations, when compared with just joining Baskets with itself, as suggested by the ordering of Fig. 6.
JOIN JOIN Baskets OK1 Baskets JOIN OK2 Figure 6: Preferred order for join in marketbasket ock Notice that the ordering in Fig. 6 is not a leftdeep ordering, which suggests that the typical commercial DBMS would not nd this order, and a query ocks compiler needs to feed simpler queries to the DBMS so the right order of join is used by the DBMS. 2 Example 4.4 : Now let us consider how we might use lter steps to improve the running time of the nal
join in Example 4.2. Using just the Cust subgoal is a lter on f$a; $vg, but there is no useful lter for just one of these parameters. We cannot use:
Answern  Typep,$t to lter $t, because the query is not safe n appears in the head but not the body. However,
Answern  Buysn,p AND Typep,$t is safe and may be used. A possible plan for optimizing this query ock is in Fig. 7. Figure 8 shows the preferred join order for the nal step. 2 14 OK1$a,$v := FILTER $a,$v , Answern  Custn,$a,$v, COUNTAnswer = s OK2$t := FILTER $t , Answern  Buysn,p AND Typep,$t, COUNTAnswer = s OK$a,$v,$t := FILTER $a,$v,$t , Answern  Custn,$a,$v, AND Buysn,p AND Typep,$t AND OK1$a,$v AND OK2$t, COUNTAnswer = s Figure 7: Query ock plan for Example 4.2 JOIN JOIN Buys OK1 JOIN Cust Type JOIN OK2 Figure 8: Join order for nal step in Fig. 7 15 ...
View
Full
Document
This note was uploaded on 01/31/2011 for the course CS 345 taught by Professor Dunbar,a during the Fall '07 term at UC Davis.
 Fall '07
 Dunbar,A

Click to edit the document details