Histogram to estimate result set of σ A = c (R) If a histogram is available for the attribute A , the number of tuples can be estimated with more accuracy. The range in which the value c belongs is first located in the histogram . |B| : number of values per bucket (# distinct values appearing in that range) #B : number of records in bucket T ( σ A = c (R) ) = # B | B | 100 / 106
Histogram to estimate result set of σ A = c (R) Example R(A,B,C) is a relation. T(R) = 10,000 V(R,A) = 50 Estimate T ( σ A =10 (R) ) The DBMS has collected the following equi-width histogram on A range [1,10] [11,20] [21,30] [31,40] [41,50] tuples in range 50 2000 2000 3000 2950 T ( σ A =10 (R) ) = # B | B | = 50 10 = 5 101 / 106
Join Size using Histograms R 1 S Use: T(R 1 S) = T(R) × T(S) max ( V(R,A),V(S,A) ) Apply for each bucket 102 / 106
Join Size using Histograms V(R,A) = V(R,A) = bucket size |B| T(R 1 S) = buckets #B(R) × #B(S) |B| 103 / 106
Advanced Techniques Wavelets Approximate Histograms Sampling Techniques Compressed Histograms 104 / 106
Summary As should be clear by now, result size estimation is not an exact art To estimate the size of the intermediate relations, we have used parameters like T(R) and V(R,A) The DBMS keeps statistics from previous operations to be able to provide such parameters However, computing statistics are expensive and should be recomputed periodically only: statistics usually have few changes over a short time even inaccurate statistics are useful statistics recomputation might be triggered after some period of time or after some number of updates 105 / 106
