slides01-4

slides01-4 - 1 Join Sizes Sometimes, the size of a join...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 Join Sizes Sometimes, the size of a join result can be exponential in the size of the input relations, even if the join is acyclic. Example 1 Consider A1 A2 . A2 A3 .    . An,1An . Let each Ai have domain f1; 2; 3; 4g. Let each relation consist of the eight tuples such that one component is odd, the other even. Then the join result consists of all tuples over A1 A2    An with alternating odd and even components, a total of 2n+1 tuples. Yet the sum of the sizes of the n , 1 input relations is only 8n , 1. Complexity of Acyclic Joins The appropriate measure of input" to the problem of computing a join is the sum of the sizes of the input relations and the result. Desirable: algorithm that is polynomial in this size." If the join is acyclic, we can always compute the join in polynomial time. Start with a full reducer, which is polynomial in the sizes of the input relations. Then, join in any order. Since there are no globally dangling tuples, the join can only increase in size at each step, so each intermediate result is polynomial in the size of the output. Complexity of Cyclic Joins The bad news is that the computation of an acyclic join can be exponential in the sum of the input and output sizes. Example 2 Consider the attributes and relations of Example 1, but include an additional relation An A1 in the join, as: A1 A2 . A2 A3 .    . An,1An . AnA1 2 This join is clearly cyclic. If n is odd, its result is empty, because the join of the rst n , 1 relations allows only sequences of odd o and even e numbers of the forms oeoe    oeo and eoeo    eoe, while the last term An A1 requires An and A1 to have values of di erent parity. The full reduction doesn't change the relations, because every attribute of every relation has f1; 2; 3; 4g in its column. No matter how we group the relations for a join, before the nal join there will be a relation formed from at least n + 1=2 of the given relations. This intermediate relation has at least 2n+5=2 tuples, so it surely takes time exponential in n to compute. Computing the Projection of a Join Things only get worse when what we want is not the join of relations, but some projection of that join, e.g., ACE AB . BCD . DE . This form appears commonly in queries, but there is usually enough selection applied to the relations before joining that there is an e cient query plan. Not so, when the query" is really the de nition of a materialized view. Then the joined relations are often entire base tables, and the exponentiality of the problem is real. Projections of Acyclic Joins: Yannakakis' Algorithm The key idea is to use the parse tree" implicit in a GYO reduction to guide the order of joins. First step is to fully reduce the input relations. During the join phase, project out all unnecssary attributes those not in the nal projection and not needed in any future join after each join step. Intermediate relations are no larger than the product of the input and output sizes. 3 Example 3 Consider the acyclic join-projection: AG ABC . BF . BCD . CDE . DEG Here is its acyclic hypergraph: F A B C D E G Parse Trees When we perform a GYO reduction, we may construct a parse tree as follows: Tree nodes correspond to hyperedges. The children of tree node H are all those hyperedges consumed by H . We choose as a join order one in which each node is joined with its parent, in some bottom up order i.e., do not join a node into its parent, until all its children have been joined into it. After each join into a relation R, project the result onto the set of attributes that are either in the schema of R or on the projection list. Example 4 Here is one possible parse tree for the join hypergraph of Example 3: CDE BCD ABC DEG BF Here are example relation instances, which are already fully reduced: 4 CDE c1 d1 e1 c1 d2 e1 BCD b1 c1 d1 b1 c1 d2 ABC a1 b1 c1 a2 b1 c1 D d1 d1 d2 E e1 e1 e1 G g1 g2 g1 BF b1 f 1 b1 f 2 We may join in any bottom-up order. Suppose we rst join BF into BCD. We get a relation with 4 tuples. However, F is not in the projection list A; G, so we project this relation onto BCD again, leaving the same relation for BCD. The result is that BF has been eliminated, with no other changes. CDE c1 d1 e1 c1 d2 e1 BCD b1 c1 d1 b1 c1 d2 D d1 d1 d2 E e1 e1 e1 G g1 g2 g1 ABC a1 b1 c1 a2 b1 c1 We next choose to join ABC into BCD. Since A appears in the projection list, it is retained at the node for BCD, which now has schema ABCD, as: CDE c1 d1 e1 c1 d2 e1 A a1 a1 a2 a2 B b1 b1 b1 b1 C c1 c1 c1 c1 D d1 d2 d1 d2 D d1 d1 d2 E e1 e1 e1 G g1 g2 g1 Suppose we next join ABCD into CDE . We must project out B , since it is neither an attribute of CDE nor on the project list. However, A remains because it is on the project list: 5 A a1 a1 a2 a2 C c1 c1 c1 c1 D d1 d2 d1 d2 E e1 e1 e1 e1 D d1 d1 d2 E e1 e1 e1 G g1 g2 g1 Last, we join DEG into ACDE . Attribute G remains, because it is in the project list: A a1 a1 a1 a2 a2 a2 C c1 c1 c1 c1 c1 c1 D d1 d1 d2 d1 d1 d2 E e1 e1 e1 e1 e1 e1 G g1 g2 g1 g1 g2 g1 Our nal step is to project the resulting relation ACDEG onto AG, which gives a nal result consisting of the four tuples fa1g1; a1g2 ; a2g1; a2g2 g. Why Yannakakis' Algorithm is Polynomial Consider a relation R at some node, which at some time during the algorithm has been replaced by R Y R . S1 . S2 .    . Sk  where: 1. S1 ; : : :; Sk are some of the relations descending from R in the parse tree. 2. Y is the set attributes on the project list that are not in R but in at least one of S1 ; : : :; Sk . ** Tricky point: No attribute other than R and output attributes ever need to be in the schema of R. The proof depends on a number of ideas we haven't had: Before there was GYO reduction, the full-reducer theorem was proven using a di erent de nition of acyclicity. A hypergraph was said to be acylic if its hyperedges could be mapped to nodes, and the nodes placed in a parse tree, so that for each attribute A, the nodes with A in their schema formed a subtree not necessarily at the root. 6 One can prove that this de nition of acyclic" is equivalent to the GYO-based de nition we use today. Thus, if A is an attribute at a child of R, but not at R, A cannot appear at any ancestor of R and is thus not needed after we join that child with R. In what follows, we count the size" of a relation instance as the number of its tuples. Technically, we need to consider the number of components of tuples as well, but since the set of relations and their schemas may be considered xed, we are ignoring constant factors only. Let T = R . S1 . S2 .    . Sk . Then R Y T  R T   Y T . To see why, notice that R and Y are disjoint sets of attributes. As always, we're using R as both an instance and a schema, where appropriate. Because the relations are fully reduced, joins only increase in size, and no tuple in an intermediate join can be dangling. Thus: 1. Y T  is no larger than LT , where L is the entire project list for the query. 2. L T  is no larger than the output, since every tuple of T extends to at least one tuple in the join of all relations. 3. Putting 1 and 2 together: Y T  is no larger than the output! R T  = R again, because the relations are fully reduced, so RT  is surely no bigger than the input. We conclude that R Y T  is no bigger than the product of the input and output, i.e., polynomial in the input + output sizes. Final step: the number of computations of polynoimal-sized relations is a constant, depending only on the schemas and not on the instances. ...
View Full Document

Ask a homework question - tutors are online