**preview**has

**blurred**sections. Sign up to view the full version! View Full Document

**Unformatted text preview: **DATABASE MANAGEMENT SYSTEMS SOLUTIONS MANUAL
Raghu Ramakrishnan et al.
University of Wisconsin Madison, WI, USA
CONTENTS
PREFACE 1 2 3 4 5 6 7 8 9 INTRODUCTION TO DATABASE SYSTEMS THE ENTITY-RELATIONSHIP MODEL THE RELATIONAL MODEL RELATIONAL ALGEBRA AND CALCULUS SQL: QUERIES, PROGRAMMING, TRIGGERS QUERY-BY-EXAMPLE (QBE) STORING DATA: DISKS AND FILES FILE ORGANIZATIONS AND INDEXES TREE-STRUCTURED INDEXING
iii 1 5 14 23 40 56 65 72 75 87 105 109 118 119
10 HASH-BASED INDEXING 11 EXTERNAL SORTING 12 EVALUATION OF RELATIONAL OPERATORS 13 INTRODUCTION TO QUERY OPTIMIZATION 14 A TYPICAL QUERY OPTIMIZER
i
ii
Database Management Systems Solutions Manual
134 145 158 163 168 180 190
15 SCHEMA REFINEMENT AND NORMAL FORMS 16 PHYSICAL DATABASE DESIGN AND TUNING 17 SECURITY 18 TRANSACTION MANAGEMENT OVERVIEW 19 CONCURRENCY CONTROL 20 CRASH RECOVERY 21 PARALLEL AND DISTRIBUTED DATABASES
PREFACE
It is not every question that deserves an answer. Publius Syrus, 42 B.C.
I hope that most of the questions in this book deserve an answer. The set of questions is unusually extensive, and is designed to reinforce and deepen students’ understanding of the concepts covered in each chapter. There is a strong emphasis on quantitative and problem-solving type exercises. Answers to almost all chapter exercises are included in this solutions manual for Chapters 1 through 19. Solutions for Chapters 20 through 22 are currently unavailable. While I wrote some of the solutions myself, most were written originally by students in the database classes at Wisconsin. I’d like to thank the many students who helped in developing and checking the solutions to the exercises; this manual would not be available without their contributions. In alphabetical order: X. Bao, S. Biao, M. Chakrabarti, C. Chan, W. Chen, N. Cheung, D. Colwell, C. Fritz, V. Ganti, J. Gehrke, G. Glass, V. Gopalakrishnan, M. Higgins, T. Jasmin, M. Krishnaprasad, Y. Lin, C. Liu, M. Lusignan, H. Modi, S. Narayanan, D. Randolph, A. Ranganathan, J. Reminga, A. Therber, M. Thomas, Q. Wang, R. Wang, Z. Wang and J. Yuan. In addition, James Harrington and Martin Reames at Wisconsin and Nina Tang at Berkeley provided especially detailed feedback. Several students contributed to each chapter’s solutions, and answers were subsequently checked by me and by other students. This manual has been in use for several semesters. I hope that it is now mostly accurate, but I’m sure it still contains errors and omissions. If you are a student and you do not understand a particular solution, contact your instructor; it may be that you are missing something, but it may also be that the solution is incorrect! If you discover a bug, please send me mail (raghu@cs.wisc.edu) and I will update the manual promptly. The latest version of this solutions manual is distributed freely through the Web; go to the home page mentioned below to obtain a copy.
iii
Database Management Systems Solutions Manual For More Information
The home page for this book is at URL: http://www.cs.wisc.edu/ dbbook This page is frequently updated and contains information about the book, past and current users, and the software. This page also contains a link to all known errors in the book, the accompanying slides, and the software. Since the solutions manual is distributed electronically, all known errors are immediately ﬁxed and no list of errors is maintained. Instructors are advised to visit this site periodically; they can also register at this site to be notiﬁed of important changes by email.
1
INTRODUCTION TO DATABASE SYSTEMS
Exercise 1.1 Why would you choose a database system instead of simply storing data in operating system ﬁles? When would it make sense not to use a database system? Answer 1.1 A database is an integrated collection of data, usually so large that it has to be stored on secondary storage devices such as disks or tapes. This data can be maintained as a collection of operating system ﬁles, or stored in a DBMS (database management system). The advantages of using a DBMS are: Data independence and eﬃcient access. Database application programs are independent of the details of data representation and storage. The conceptual and external schemas provide independence from physical storage decisions and logical design decisions respectively. In addition, a DBMS provides eﬃcient storage and retrieval mechanisms, including support for very large ﬁles, index structures and query optimization. Reduced application development time. Since the DBMS provides several important functions required by applications, such as concurrency control and crash recovery, high level query facilities, etc., only application-speciﬁc code needs to be written. Even this is facilitated by suites of application development tools available from vendors for many database management systems. Data integrity and security. The view mechanism and the authorization facilities of a DBMS provide a powerful access control mechanism. Further, updates to the data that violate the semantics of the data can be detected and rejected by the DBMS if users specify the appropriate integrity constraints. Data administration. By providing a common umbrella for a large collection of data that is shared by several users, a DBMS facilitates maintenance and data administration tasks. A good DBA can eﬀectively shield end-users from the chores of ﬁne-tuning the data representation, periodic back-ups etc.
1
2
Chapter 1
Concurrent access and crash recovery. A DBMS supports the notion of a transaction, which is conceptually a single user’s sequential program. Users can write transactions as if their programs were running in isolation against the database. The DBMS executes the actions of transactions in an interleaved fashion to obtain good performance, but schedules them in such a way as to ensure that conﬂicting operations are not permitted to proceed concurrently. Further, the DBMS maintains a continuous log of the changes to the data, and if there is a system crash, it can restore the database to a transaction-consistent state. That is, the actions of incomplete transactions are undone, so that the database state reﬂects only the actions of completed transactions. Thus, if each complete transaction, executing alone, maintains the consistency criteria, then the database state after recovery from a crash is consistent.
If these advantages are not important for the application at hand, using a collection of ﬁles may be a better solution because of the increased cost and overhead of purchasing and maintaining a DBMS. Exercise 1.2 What is logical data independence and why is it important? Answer 1.2 Answer omitted. Exercise 1.3 Explain the diﬀerence between logical and physical data independence. Answer 1.3 Logical data independence means that users are shielded from changes in the logical structure of the data, while physical data independence insulates users from changes in the physical storage of the data. We saw an example of logical data independence in the answer to Exercise 1.2. Consider the Students relation from that example (and now assume that it is not replaced by the two smaller relations). We could choose to store Students tuples in a heap ﬁle, with a clustered index on the sname ﬁeld. Alternatively, we could choose to store it with an index on the gpa ﬁeld, or to create indexes on both ﬁelds, or to store it as a ﬁle sorted by gpa. These storage alternatives are not visible to users, except in terms of improved performance, since they simply see a relation as a set of tuples. This is what is meant by physical data independence. Exercise 1.4 Explain the diﬀerence between external, internal, and conceptual schemas. How are these diﬀerent schema layers related to the concepts of logical and physical data independence? Answer 1.4 Answer omitted. Exercise 1.5 What are the responsibilities of a DBA? If we assume that the DBA is never interested in running his or her own queries, does the DBA still need to understand query optimization? Why?
Introduction to Database Systems
Answer 1.5 The DBA is responsible for:
3
Designing the logical and physical schemas, as well as widely-used portions of the external schema. Security and authorization. Data availability and recovery from failures. Database tuning: The DBA is responsible for evolving the database, in particular the conceptual and physical schemas, to ensure adequate performance as user requirements change. A DBA needs to understand query optimization even if s/he is not interested in running his or her own queries because some of these responsibilities (database design and tuning) are related to query optimization. Unless the DBA understands the performance needs of widely used queries, and how the DBMS will optimize and execute these queries, good design and tuning decisions cannot be made.
Exercise 1.6 Scrooge McNugget wants to store information (names, addresses, descriptions of embarrassing moments, etc.) about the many ducks on his payroll. Not surprisingly, the volume of data compels him to buy a database system. To save money, he wants to buy one with the fewest possible features, and he plans to run it as a stand-alone application on his PC clone. Of course, Scrooge does not plan to share his list with anyone. Indicate which of the following DBMS features Scrooge should pay for; in each case also indicate why Scrooge should (or should not) pay for that feature in the system he buys. 1. A security facility. 2. Concurrency control. 3. Crash recovery. 4. A view mechanism. 5. A query language. Answer 1.6 Answer omitted.
Exercise 1.7 Which of the following plays an important role in representing information about the real world in a database? Explain brieﬂy. 1. The data deﬁnition language.
4
2. The data manipulation language. 3. The buﬀer manager. 4. The data model. Answer 1.7 Let us discuss the choices in turn.
Chapter 1
The data deﬁnition language is important in representing information because it is used to describe external and logical schemas. The data manipulation language is used to access and update data; it is not important for representing the data. (Of course, the data manipulation language must be aware of how data is represented, and reﬂects this in the constructs that it supports.) The buﬀer manager is not very important. The data model is fundamental to representing information. The data model determines what data representation mechanisms are supported by the DBMS. The data deﬁnition language is just the speciﬁc set of language constructs available to describe an actual application’s data in terms of the data model.
Exercise 1.8 Describe the structure of a DBMS. If your operating system is upgraded to support some new functions on OS ﬁles (e.g., the ability to force some sequence of bytes to disk), which layer(s) of the DBMS would you have to rewrite in order to take advantage of these new functions? Answer 1.8 Answer omitted.
Exercise 1.9 Answer the following questions: 1. What is a transaction? 2. Why does a DBMS interleave the actions of diﬀerent transactions, instead of executing transactions one after the other? 3. What must a user guarantee with respect to a transaction and database consistency? What should a DBMS guarantee with respect to concurrent execution of several transactions and database consistency? 4. Explain the strict two-phase locking protocol. 5. What is the WAL property, and why is it important?
2
THE ENTITY-RELATIONSHIP MODEL
Exercise 2.1 Explain the following terms brieﬂy: attribute, domain, entity, relationship, entity set, relationship set, one-to-many relationship, many-to-many relationship, participation constraint, overlap constraint, covering constraint, weak entity set, aggregation, and role indicator. Answer 2.1 No answer provided yet.
Exercise 2.2 A university database contains information about professors (identiﬁed by social security number, or SSN) and courses (identiﬁed by courseid). Professors teach courses; each of the following situations concerns the Teaches relationship set. For each situation, draw an ER diagram that describes it (assuming that no further constraints hold). 1. Professors can teach the same course in several semesters, and each oﬀering must be recorded. 2. Professors can teach the same course in several semesters, and only the most recent such oﬀering needs to be recorded. (Assume this condition applies in all subsequent questions.) 3. Every professor must teach some course. 4. Every professor teaches exactly one course (no more, no less). 5. Every professor teaches exactly one course (no more, no less), and every course must be taught by some professor. 6. Now suppose that certain courses can be taught by a team of professors jointly, but it is possible that no one professor in a team can teach the course. Model this situation, introducing additional entity sets and relationship sets if necessary. Answer 2.2 Answer omitted. 5
6
Chapter 2
Exercise 2.3 Consider the following information about a university database: Professors have an SSN, a name, an age, a rank, and a research specialty. Projects have a project number, a sponsor name (e.g., NSF), a starting date, an ending date, and a budget. Graduate students have an SSN, a name, an age, and a degree program (e.g., M.S. or Ph.D.). Each project is managed by one professor (known as the project’s principal investigator). Each project is worked on by one or more professors (known as the project’s co-investigators). Professors can manage and/or work on multiple projects. Each project is worked on by one or more graduate students (known as the project’s research assistants). When graduate students work on a project, a professor must supervise their work on the project. Graduate students can work on multiple projects, in which case they will have a (potentially diﬀerent) supervisor for each one. Departments have a department number, a department name, and a main oﬃce. Departments have a professor (known as the chairman) who runs the department. Professors work in one or more departments, and for each department that they work in, a time percentage is associated with their job. Graduate students have one major department in which they are working on their degree. Each graduate student has another, more senior graduate student (known as a student advisor) who advises him or her on what courses to take. Design and draw an ER diagram that captures the information about the university. Use only the basic ER model here, that is, entities, relationships, and attributes. Be sure to indicate any key and participation constraints. Answer 2.3 The ER diagram is shown in Figure 2.1.
Exercise 2.4 A company database needs to store information about employees (identiﬁed by ssn, with salary and phone as attributes); departments (identiﬁed by dno, with dname and budget as attributes); and children of employees (with name and age as attributes). Employees work in departments; each department is managed by an
The Entity-Relationship Model
age
speciality work_in
pid
start_date
ssn
rank
sponsor
end_date
budget Professor Manages project
Figure 2.1 ER Diagram for Exercise 2.3
Work_dept
Runs
Supervises
Work_proj
ssn
pc_time
Dept
Major Graduate
senior Advisor grad
dno
office ssn dname age deg_prog name
7
8
Chapter 2
employee; a child must be identiﬁed uniquely by name when the parent (who is an employee; assume that only one parent works for the company) is known. We are not interested in information about a child once the parent leaves the company. Draw an ER diagram that captures this information. Answer 2.4 Answer omitted.
Exercise 2.5 Notown Records has decided to store information about musicians who perform on its albums (as well as other company data) in a database. The company has wisely chosen to hire you as a database designer (at your usual consulting fee of $2,500/day). Each musician that records at Notown has an SSN, a name, an address, and a phone number. Poorly paid musicians often share the same address, and no address has more than one phone. Each instrument that is used in songs recorded at Notown has a name (e.g., guitar, synthesizer, ﬂute) and a musical key (e.g., C, B-ﬂat, E-ﬂat). Each album that is recorded on the Notown label has a title, a copyright date, a format (e.g., CD or MC), and an album identiﬁer. Each song recorded at Notown has a title and an author. Each musician may play several instruments, and a given instrument may be played by several musicians. Each album has a number of songs on it, but no song may appear on more than one album. Each song is performed by one or more musicians, and a musician may perform a number of songs. Each album has exactly one musician who acts as its producer. A musician may produce several albums, of course. Design a conceptual schema for Notown and draw an ER diagram for your schema. The following information describes the situation that the Notown database must model. Be sure to indicate all key and cardinality constraints and any assumptions that you make. Identify any constraints that you are unable to capture in the ER diagram and brieﬂy explain why you could not express them. Answer 2.5 The ER diagram is shown in Figure 2.2.
address phone_no
Place Home
Telephone
name speed
albumIdentifier
Lives
ssn copyrightDate title
The Entity-Relationship Model
Figure 2.2
Musicians Producer Album Plays Perform Appears Instrument Songs instrId key songId suthor dname title
ER Diagram for Exercise 2.5
9
10
Chapter 2
Exercise 2.6 Computer Sciences Department frequent ﬂiers have been complaining to Dane County Airport oﬃcials about the poor organization at the airport. As a result, the oﬃcials have decided that all information related to the airport should be organized using a DBMS, and you’ve been hired to design the database. Your ﬁrst task is to organize the information about all the airplanes that are stationed and maintained at the airport. The relevant information is as follows: Every airplane has a registration number, and each airplane is of a speciﬁc model. The airport accommodates a number of airplane models, and each model is identiﬁed by a model number (e.g., DC-10) and has a capacity and a weight. A number of technicians work at the airport. You need to store the name, SSN, address, phone number, and salary of each technician. Each technician is an expert on one or more plane model(s), and his or her expertise may overlap with that of other technicians. This information about technicians must also be recorded. Traﬃc controllers must have an annual medical examination. For each traﬃc controller, you must store the date of the most recent exam. All airport employees (including technicians) belong to a union. You must store the union membership number of each employee. You can assume that each employee is uniquely identiﬁed by the social security number. The airport has a number of tests that are used periodically to ensure that airplanes are still airworthy. Each test has a Federal Aviation Administration (FAA) test number, a name, and a maximum possible score. The FAA requires the airport to keep track of each time that a given airplane is tested by a given technician using a given test. For each testing event, the information needed is the date, the number of hours the technician spent doing the test, and the score that the airplane received on the test. 1. Draw an ER diagram for the airport database. Be sure to indicate the various attributes of each entity and relationship set; also specify the key and participation constraints for each relationship set. Specify any necessary overlap and covering constraints as well (in English). 2. The FAA passes a regulation that tests on a plane must be conducted by a technician who is an expert on that model. How would you express this constraint in the ER diagram? If you cannot express it, explain brieﬂy. Answer 2.6 Answer omitted.
The Entity-Relationship Model
11
Exercise 2.7 The Prescriptions-R-X chain of pharmacies has oﬀered to give you a free lifetime supply of medicines if you design its database. Given the rising cost of health care, you agree. Here’s the information that you gather: Patients are identiﬁed by an SSN, and their names, addresses, and ages must be recorded. Doctors are identiﬁed by an SSN. For each doctor, the name, specialty, and years of experience must be recorded. Each pharmaceutical company is identiﬁed by name and has a phone number. For each drug, the trade name and formula must be recorded. Each drug is sold by a given pharmaceutical company, and the trade name identiﬁes a drug uniquely from among the products of that company. If a pharmaceutical company is deleted, you need not keep track of its products any longer. Each pharmacy has a name, address, and phone number. Every patient has a primary physician. Every doctor has at least one patient. Each pharmacy sells several drugs and has a price for each. A drug could be sold at several pharmacies, and the price could vary from one pharmacy to another. Doctors prescribe drugs for patients. A doctor could prescribe one or more drugs for several patients, and a patient could obtain prescriptions from several doctors. Each prescription has a date and a quantity associated with it. You can assume that if a doctor prescribes the same drug for the same patient more than once, only the last such prescription needs to be stored. Pharmaceutical companies have long-term contracts with pharmacies. A pharmaceutical company can contract with several pharmacies, and a pharmacy can contract with several pharmaceutical companies. For each contract, you have to store a start date, an end date, and the text of the contract. Pharmacies appoint a supervisor for each contract. There must always be a supervisor for each contract, but the contract supervisor can change over the lifetime of the contract. 1. Draw an ER diagram that captures the above information. Identify any constraints that are not captured by the ER diagram. 2. How would your design change if each drug must be sold at a ﬁxed price by all pharmacies? 3. How would your design change if the design requirements change as follows: If a doctor prescribes the same drug for the same patient more than once, several such prescriptions may have to be stored.
12
Chapter 2
age
address
phy_ssn
speciality
ssn
name
name
exp_years
Patient
Pri_physician
Doctor
date Prescription name address phone_num quentity
trade_name Pharmacy Sell Drug formula start_date price Contract
Make
end_date
text
Pharm_co
supervisor name phone_num
Figure 2.3
ER Diagram for Exercise 2.8
The Entity-Relationship Model
Answer 2.7 1. The ER diagram is shown in Figure 2.3.
13
2. If the drug is to be sold at a ﬁxed price we can add the price attribute to the Drug entity set and eliminate the Sell relationship set. 3. The date information can no longer be modeled as an attribute of Prescription. We have to create a new entity set called Prescription date and make Prescription a 4-way relationship set that involves this additional entity set.
Exercise 2.8 Although you always wanted to be an artist, you ended up being an expert on databases because you love to cook data and you somehow confused ‘data base’ with ‘data baste.’ Your old love is still there, however, so you set up a database company, ArtBase, that builds a product for art galleries. The core of this product is a database with a schema that captures all the information that galleries need to maintain. Galleries keep information about artists, their names (which are unique), birthplaces, age, and style of art. For each piece of artwork, the artist, the year it was made, its unique title, its type of art (e.g., painting, lithograph, sculpture, photograph), and its price must be stored. Pieces of artwork are also classiﬁed into groups of various kinds, for example, portraits, still lifes, works by Picasso, or works of the 19th century; a given piece may belong to more than one group. Each group is identiﬁed by a name (like those above) that describes the group. Finally, galleries keep information about customers. For each customer, galleries keep their unique name, address, total amount of dollars they have spent in the gallery (very important!), and the artists and groups of art that each customer tends to like. Draw the ER diagram for the database. Answer 2.8 Answer omitted.
3
THE RELATIONAL MODEL
Exercise 3.1 Deﬁne the following terms: relation schema, relational database schema, domain, relation instance, relation cardinality, and relation degree. Answer 3.1 A relation schema can be thought of as the basic information describing a table or relation. This includes a set of column names, the data types associated with each column, and the name associated with the entire table. For example, a relation schema for the relation called Students could be expressed using the following representation: Students(sid: string, name: string, login: string, age: integer, gpa: real) There are ﬁve ﬁelds or columns, with names and types as shown above. A relational database schema is a collection of relation schemas, describing one or more relations. Domain is synonymous with data type. Attributes can be thought of as columns in a table. Therefore, an attribute domain refers to the data type associated with a column. A relation instance is a set of tuples (also known as rows or records) that each conform to the schema of the relation. The relation cardinality is the number of tuples in the relation. The relation degree is the number of ﬁelds (or columns) in the relation.
Exercise 3.2 How many distinct tuples are in a relation instance with cardinality 22? Answer 3.2 Answer omitted. 14
The Relational Model
15
Exercise 3.3 Does the relational model, as seen by an SQL query writer, provide physical and logical data independence? Explain. Answer 3.3 The user of SQL has no idea how the data is physically represented in the machine. He or she relies entirely on the relation abstraction for querying. Physical data independence is therefore assured. Since a user can deﬁne views, logical data independence can also be achieved by using view deﬁnitions to hide changes in the conceptual schema.
Exercise 3.4 What is the diﬀerence between a candidate key and the primary key for a given relation? What is a superkey? Answer 3.4 Answer omitted.
FIELDS (ATTRIBUTES, COLUMNS) Field names
sid
name
login dave@cs jones@cs smith@ee smith@math guldu@music
age 19 18 18 19 11 12
gpa 3.3 3.4 3.2 3.8 1.8 2.0
50000 Dave 53666 Jones TUPLES (RECORDS, ROWS) 53688 Smith 53650 Smith 53832 Guldu
Figure 3.1
53831 Madayan madayan@music
An Instance S1 of the Students Relation
Exercise 3.5 Consider the instance of the Students relation shown in Figure 3.1. 1. Give an example of an attribute (or set of attributes) that you can deduce is not a candidate key, based on this instance being legal. 2. Is there any example of an attribute (or set of attributes) that you can deduce is a candidate key, based on this instance being legal? Answer 3.5 Examples of non-candidate keys include the following: {name}, {age}. (Note that {gpa} can not be declared a non-candidate key from this evidence alone (even though common sense tells us that clearly more than one student could have the same grade point average.) You cannot determine a key of a relation given only one instance of the relation. The fact that the instance is “legal” is immaterial. A candidate key, as deﬁned here, is a
16
Chapter 3
key, not something that only might be a key. The instance shown is just one possible “snapshot” of the relation. At other times, the same relation may have an instance (or snapshot) that contains a totally diﬀerent set of tuples, and we cannot make predictions about those instances based only upon the instance that we are given. Exercise 3.6 What is a foreign key constraint? Why are such constraints important? What is referential integrity? Answer 3.6 Answer omitted. Exercise 3.7 Consider the relations Students, Faculty, Courses, Rooms, Enrolled, Teaches, and Meets In that were deﬁned in Section 1.5.2. 1. List all the foreign key constraints among these relations. 2. Give an example of a (plausible) constraint involving one or more of these relations that is not a primary key or foreign key constraint. Answer 3.7 There is no reason for a foreign key constraint (FKC) on the Students, Faculty, Courses, or Rooms relations. These are the most basic relations and must be free-standing. Special care must be given to entering data into these base relations. In the Enrolled relation, sid and cid should both have FKCs placed on them. (Real students must be enrolled in real courses.) Also, since real teachers must teach real courses, both the f id and the cid ﬁelds in the Teaches relation should have FKCs. Finally, Meets In should place FKCs on both the cid and rno ﬁelds. It would probably be wise to enforce a few other constraints on this DBMS: the length of sid, cid, and f id could be standardized; checksums could be added to these identiﬁcation numbers; limits could be placed on the size of the numbers entered into the credits, capacity, and salary ﬁelds; an enumerated type should be assigned to the grade ﬁeld (preventing a student from receiving a grade of G, among other things); etc. Exercise 3.8 Answer each of the following questions brieﬂy. The questions are based on the following relational schema: Emp(eid: integer, ename: string, age: integer, salary: real) Works(eid: integer, did: integer, pct time: integer) Dept(did: integer, dname: string, budget: real, managerid: integer) 1. Give an example of a foreign key constraint that involves the Dept relation. What are the options for enforcing this constraint when a user attempts to delete a Dept tuple?
The Relational Model
17
2. Write the SQL statements required to create the above relations, including appropriate versions of all primary and foreign key integrity constraints. 3. Deﬁne the Dept relation in SQL so that every department is guaranteed to have a manager. 4. Write an SQL statement to add ‘John Doe’ as an employee with eid = 101, age = 32 and salary = 15, 000. 5. Write an SQL statement to give every employee a 10% raise. 6. Write an SQL statement to delete the ‘Toy’ department. Given the referential integrity constraints you chose for this schema, explain what happens when this statement is executed. Answer 3.8 Answer omitted.
sid 53831 53832
name Madayan Guldu
login madayan@music guldu@music
age 11 12
gpa 1.8 2.0
Figure 3.2
Students with age < 18 on Instance S1
Exercise 3.9 Consider the SQL query whose answer is shown in Figure 3.2. 1. Modify this query so that only the login column is included in the answer. 2. If the clause WHERE S.gpa >= 2 is added to the original query, what is the set of tuples in the answer? Answer 3.9 The answers are as follows: 1. Only login is included in the answer: SELECT login FROM Students S WHERE S.age < 18 2. The answer tuple for Madayan is omitted.
Exercise 3.10 Explain why the addition of NOT NULL constraints to the SQL deﬁnition of the Manages relation (in Section 3.5.3) would not enforce the constraint that each department must have a manager. What, if anything, is achieved by requiring that the ssn ﬁeld of Manages be non-null?
18
Answer 3.10 Answer omitted.
Chapter 3
Exercise 3.11 Suppose that we have a ternary relationship R between entity sets A, B, and C such that A has a key constraint and total participation and B has a key constraint; these are the only constraints. A has attributes a1 and a2, with a1 being the key; B and C are similar. R has no descriptive attributes. Write SQL statements that create tables corresponding to this information so as to capture as many of the constraints as possible. If you cannot capture some constraint, explain why. Answer 3.11 The following SQL statement creates Table A: CREATE TABLE A ( a1 CHAR(10), a2 CHAR(10), PRIMARY KEY (a1) )
Tables B and C are similar to A. CREATE TABLE R ( a1 CHAR(10), b1 CHAR(10) NOT NULL c1 CHAR(10) , PRIMARY KEY (a1), UNIQUE (b1) FOREIGN KEY (a1) REFERENCES FOREIGN KEY (b1) REFERENCES FOREIGN KEY (c1) REFERENCES
,
A, B) C)
We cannot capture the total participation constraint of A in R. This is because we cannot ensure that every key a1 appears in R without the use of checks.
Exercise 3.12 Consider the scenario from Exercise 2.2 where you designed an ER diagram for a university database. Write SQL statements to create the corresponding relations and capture as many of the constraints as possible. If you cannot capture some constraints, explain why. Answer 3.12 Answer omitted.
Exercise 3.13 Consider the university database from Exercise 2.3 and the ER diagram that you designed. Write SQL statements to create the corresponding relations and capture as many of the constraints as possible. If you cannot capture some constraints, explain why.
The Relational Model
Answer 3.13 Answer omitted.
19
Exercise 3.14 Consider the scenario from Exercise 2.4 where you designed an ER diagram for a company database. Write SQL statements to create the corresponding relations and capture as many of the constraints as possible. If you cannot capture some constraints, explain why. Answer 3.14 Answer omitted. Exercise 3.15 Consider the Notown database from Exercise 2.4. You have decided to recommend that Notown use a relational database system to store company data. Show the SQL statements for creating relations corresponding to the entity sets and relationship sets in your design. Identify any constraints in the ER diagram that you are unable to capture in the SQL statements and brieﬂy explain why you could not express them. Answer 3.15 The following SQL statements create the corresponding relations. 1. CREATE TABLE Musicians ( ssn CHAR(10), name CHAR(30), PRIMARY KEY (ssn)) 2. CREATE TABLE Instruments (nstrId CHAR(10), i dname CHAR(30), key CHAR(5), PRIMARY KEY (instrId)) 3. CREATE TABLE Plays ( ssn CHAR(10), instrId INTEGER, PRIMARY KEY (ssn, instrId), FOREIGN KEY (ssn) REFERENCES Musicians, FOREIGN KEY (instrId) REFERENCES Instruments )
INTEGER, 4. CREATE TABLE Songs Appears ( songId author CHAR(30), title CHAR(30), albumIdentiﬁer INTEGER NOT NULL, PRIMARY KEY (songId), FOREIGN KEY (albumIdentiﬁer) References Album Producer,
20
Chapter 3
5. CREATE TABLE Telephone Home ( phone CHAR(11), address CHAR(30), PRIMARY KEY (phone), FOREIGN KEY (address) REFERENCES Place, 6. CREATE TABLE Lives ( ssn CHAR(10), phone CHAR(11), address CHAR(30), PRIMARY KEY (ssn, address), FOREIGN KEY (phone, address) References Telephone Home, FOREIGN KEY (ssn) REFERENCES Musicians ) address CHAR(30) ) songId INTEGER, ssn CHAR(10), PRIMARY KEY (ssn, songId), FOREIGN KEY (songId) REFERENCES Songs, FOREIGN KEY (ssn) REFERENCES Musicians )
7. CREATE TABLE Place ( 8. CREATE TABLE Perform (
ssn CHAR(10) NOT NULL, 9. CREATE TABLE Album Producer ( albumIdentiﬁer INTEGER, copyrightDate DATE, speed INTEGER, title CHAR(30), PRIMARY KEY (albumIdentiﬁer), FOREIGN KEY (ssn) REFERENCES Musicians )
Exercise 3.16 Translate your ER diagram from Exercise 2.6 into a relational schema, and show the SQL statements needed to create the relations, using only key and null constraints. If your translation cannot capture any constraints in the ER diagram, explain why. In Exercise 2.6, you also modiﬁed the ER diagram to include the constraint that tests on a plane must be conducted by a technician who is an expert on that model. Can you modify the SQL statements deﬁning the relations obtained by mapping the ER diagram to check this constraint? Answer 3.16 Answer omitted.
The Relational Model
21
Exercise 3.17 Consider the ER diagram that you designed for the Prescriptions-R-X chain of pharmacies in Exercise 2.7. Deﬁne relations corresponding to the entity sets and relationship sets in your design using SQL. Answer 3.17 The statements to create tables corresponding to entity sets Doctor, Pharmacy, and Pharm co are straightforward and omitted. The other required tables can be created as follows: CHAR(11), 1. CREATE TABLE Pri Phy Patient ( ssn name CHAR(20), age INTEGER, address CHAR(20), CHAR(11), phy ssn PRIMARY KEY (ssn), FOREIGN KEY (phy ssn) REFERENCES Doctor ) 2. CREATE TABLE Prescription ( ssn CHAR(11), CHAR(11), phy ssn date CHAR(11), quantity INTEGER, trade name CHAR(20), pharm id CHAR(11), PRIMARY KEY (ssn, phy ssn), FOREIGN KEY (ssn) REFERENCES Patient, FOREIGN KEY (phy ssn) REFERENCES Doctor) FOREIGN KEY (trade name, pharm id) References Make Drug) 3. CREATE TABLE Make Drug (trade name CHAR(20), CHAR(11), pharm id PRIMARY KEY (trade name, pharm id), FOREIGN KEY (trade name) REFERENCES Drug, FOREIGN KEY (pharm id) REFERENCES Pharm co) 4. CREATE TABLE Sell ( price INTGER, name CHAR(10), trade name CHAR(10), PRIMARY KEY (name, trade name), FOREIGN KEY (name) REFERENCES Pharmacy, FOREIGN KEY (trade name) REFERENCES Drug)
22
5. CREATE TABLE Contract (
Chapter 3
name CHAR(20), CHAR(11), pharm id CHAR(11), start date CHAR(11), end date text CHAR(10000), supervisor CHAR(20), PRIMARY KEY (name, pharm id), FOREIGN KEY (name) REFERENCES Pharmacy, FOREIGN KEY (pharm id) REFERENCES Pharm co)
Exercise 3.18 Write SQL statements to create the corresponding relations to the ER diagram you designed for Exercise 2.8. If your translation cannot capture any constraints in the ER diagram, explain why. Answer 3.18 Answer omitted.
4
RELATIONAL ALGEBRA AND CALCULUS
Exercise 4.1 Explain the statement that relational algebra operators can be composed. Why is the ability to compose operators important? Answer 4.1 Every operator in relational algebra accepts one or more relation instances as arguments and the result is always an relation instance. So the argument of one operator could be the result of another operator. This is important because, this makes it easy to write complex queries by simply composing the relational algebra operators. Exercise 4.2 Given two relations R1 and R2, where R1 contains N1 tuples, R2 contains N2 tuples, and N2 > N1 > 0, give the minimum and maximum possible sizes (in tuples) for the result relation produced by each of the following relational algebra expressions. In each case, state any assumptions about the schemas for R1 and R2 that are needed to make the expression meaningful: (1) R1 ∪R2, (2) R1 ∩R2, (3) R1 − R2, (4) R1 × R2, (5) σa=5 (R1), (6) πa (R1), and (7) R1/R2 Answer 4.2 Answer omitted. Exercise 4.3 Consider the following schema: Suppliers(sid: integer, sname: string, address: string) Parts(pid: integer, pname: string, color: string) Catalog(sid: integer, pid: integer, cost: real) The key ﬁelds are underlined, and the domain of each ﬁeld is listed after the ﬁeld name. Thus sid is the key for Suppliers, pid is the key for Parts, and sid and pid together form the key for Catalog. The Catalog relation lists the prices charged for parts by Suppliers. Write the following queries in relational algebra, tuple relational calculus, and domain relational calculus: 23
24
1. Find the names of suppliers who supply some red part. 2. Find the sids of suppliers who supply some red or green part.
Chapter 4
3. Find the sids of suppliers who supply some red part or are at 221 Packer Ave. 4. Find the sids of suppliers who supply some red part and some green part. 5. Find the sids of suppliers who supply every part. 6. Find the sids of suppliers who supply every red part. 7. Find the sids of suppliers who supply every red or green part. 8. Find the sids of suppliers who supply every red part or supply every green part. 9. Find pairs of sids such that the supplier with the ﬁrst sid charges more for some part than the supplier with the second sid. 10. Find the pids of parts that are supplied by at least two diﬀerent suppliers. 11. Find the pids of the most expensive parts supplied by suppliers named Yosemite Sham. 12. Find the pids of parts supplied by every supplier at less than $200. (If any supplier either does not supply the part or charges more than $200 for it, the part is not selected.) Answer 4.3 In the answers below RA refers to Relational Algebra, TRC refers to Tuple Relational Calculus and DRC refers to Domain Relational Calculus.
1.
RA πsname (πsid ((πpid σcolor= red P arts) TRC {T | ∃T 1 ∈ Suppliers(∃X ∈ P arts(X.color = red ∧ ∃Y ∈ Catalog (Y.pid = X.pid ∧ Y.sid = T 1.sid)) ∧ T.sname = T 1.sname)} DRC { Y | X, Y, Z ∈ Suppliers ∧ ∃P, Q, R( P, Q, R ∈ P arts ∧R = red ∧ ∃I, J, K( I, J, K ∈ Catalog ∧ J = P ∧ I = X))} SQL Catalog) Suppliers)
Relational Algebra and Calculus
SELECT S.sname FROM Suppliers S, Parts P, Catalog C WHERE P.color=’red’ AND C.pid=P.pid AND C.sid=S.sid 2. RA πsid (πpid (σcolor= red ∨color= green P arts) TRC catalog)
25
{T | ∃T 1 ∈ Catalog(∃X ∈ P arts((X.color = ‘red ∨ X.color = ‘green ) ∧X.pid = T 1.pid) ∧ T.sid = T 1.sid)} DRC { X | X, Y, Z ∈ Catalog ∧ ∃A, B, C( A, B, C ∈ P arts ∧(C = red ∨ C = green ) ∧ A = Y )} SQL SELECT C.sid FROM Catalog C, Parts P WHERE (P.color = ‘red’ OR P.color = ‘green’) AND P.pid = C.pid 3. RA ρ(R1, πsid ((πpid σcolor= red P arts) R1 ∪ R2 TRC {T | ∃T 1 ∈ Catalog(∃X ∈ P arts(X.color = ‘red ∧ X.pid = T 1.pid) ∧T.sid = T 1.sid) ∨∃T 2 ∈ Suppliers(T 2.address = 221P ackerStreet ∧ T.sid = T 2.sid)} DRC { X | X, Y, Z ∈ Catalog ∧ ∃A, B, C( A, B, C ∈ P arts ∧C = red ∧ A = Y ) ∨∃P, Q( X, P, Q ∈ Suppliers ∧ Q = 221P ackerStreet )} SQL Catalog))
ρ(R2, πsid σaddress= 221P ackerStreet Suppliers)
26
Chapter 4
SELECT S.sid FROM Suppliers S WHERE S.address = ‘221 Packer street’ OR S.sid IN ( SELECT C.sid FROM Parts P, Catalog C WHERE P.color=’red’ AND P.pid = C.pid )
4.
RA ρ(R1, πsid ((πpid σcolor= red P arts) ρ(R2, πsid ((πpid σcolor= green P arts) R1 ∩ R2 TRC {T | ∃T 1 ∈ Catalog(∃X ∈ P arts(X.color = ‘red ∧ X.pid = T 1.pid) ∧∃T 2 ∈ Catalog(∃Y ∈ P arts(Y.color = green ∧ Y.pid = T 2.pid) ∧T 2.sid = T 1.sid) ∧ T.sid = T 1.sid)} DRC { X | X, Y, Z ∈ Catalog ∧ ∃A, B, C( A, B, C ∈ P arts ∧C = red ∧ A = Y ) ∧∃P, Q, R( P, Q, R ∈ Catalog ∧ ∃E, F, G( E, F, G ∈ P arts ∧G = green ∧ E = Q) ∧ P = X)} SQL SELECT C.sid FROM Parts P, Catalog C WHERE P.color = ‘red’ AND P.pid = C.pid AND EXISTS ( SELECT P2.pid FROM Parts P2, Catalog C2 WHERE P2.color = ‘green’ AND C2.sid = C.sid AND P2.pid = C2.pid ) Catalog)) Catalog))
5.
RA (πsid,pid Catalog)/(πpid P arts) TRC {T | ∃T 1 ∈ Catalog(∀X ∈ P arts(∃T 2 ∈ Catalog (T 2.pid = X.pid ∧ T 2.sid = T 1.sid)) ∧ T.sid = T 1.sid)}
Relational Algebra and Calculus
DRC { X | X, Y, Z ∈ Catalog ∧ ∀ A, B, C ∈ P arts (∃ P, Q, R ∈ Catalog(Q = A ∧ P = X))} SQL
27
SELECT C.sid FROM Catalog C WHERE NOT EXISTS (SELECT P.pid FROM Parts P WHERE NOT EXISTS (SELECT C1.sid FROM Catalog C1 WHERE C1.sid = C.sid AND C1.pid = P.pid)) 6. RA (πsid,pid Catalog)/(πpid σcolor= red P arts) TRC {T | ∃T 1 ∈ Catalog(∀X ∈ P arts(X.color = ‘red ∨∃T 2 ∈ Catalog(T 2.pid = X.pid ∧ T 2.sid = T 1.sid)) ∧T.sid = T 1.sid)} DRC { X | X, Y, Z ∈ Catalog ∧ ∀ A, B, C ∈ P arts (C = ‘red ∨ ∃ P, Q, R ∈ Catalog(Q = A ∧ P = X))} SQL SELECT C.sid FROM Catalog C WHERE NOT EXISTS (SELECT P.pid FROM Parts P WHERE P.color = ‘red’ AND (NOT EXISTS (SELECT C1.sid FROM Catalog C1 WHERE C1.sid = C.sid AND C1.pid = P.pid))) 7. RA (πsid,pid Catalog)/(πpid σcolor= red ∨color= green P arts)
28
TRC
Chapter 4
{T | ∃T 1 ∈ Catalog(∀X ∈ P arts((X.color = ‘red ∧X.color = ‘green ) ∨ ∃T 2 ∈ Catalog (T 2.pid = X.pid ∧ T 2.sid = T 1.sid)) ∧ T.sid = T 1.sid)} DRC { X | X, Y, Z ∈ Catalog ∧ ∀ A, B, C ∈ P arts ((C = ‘red ∧ C = ‘green ) ∨ ∃ P, Q, R ∈ Catalog (Q = A ∧ P = X))} SQL SELECT C.sid FROM Catalog C WHERE NOT EXISTS (SELECT P.pid FROM Parts P WHERE (P.color = ‘red’ OR AND (NOT EXISTS (SELECT FROM WHERE
P.color = ‘green’) C1.sid Catalog C1 C1.sid = C.sid AND C1.pid = P.pid)))
8.
RA ρ(R1, ((πsid,pid Catalog)/(πpid σcolor= red P arts))) ρ(R2, ((πsid,pid Catalog)/(πpid σcolor= green P arts))) R1 ∪ R2 TRC {T | ∃T 1 ∈ Catalog((∀X ∈ P arts (X.color = ‘red ∨ ∃Y ∈ Catalog(Y.pid = X.pid ∧ Y.sid = T 1.sid)) ∨∀Z ∈ P arts(Z.color = ‘green ∨ ∃P ∈ Catalog (P.pid = Z.pid ∧ P.sid = T 1.sid))) ∧ T.sid = T 1.sid)} DRC { X | X, Y, Z ∈ Catalog ∧ (∀ A, B, C ∈ P arts (C = ‘red ∨ ∃ P, Q, R ∈ Catalog(Q = A ∧ P = X)) ∨∀ U, V, W ∈ P arts(W = ‘green ∨ M, N, L ∈ Catalog (N = U ∧ M = X)))}
Relational Algebra and Calculus
SQL SELECT C.sid FROM Catalog C WHERE (NOT EXISTS (SELECT P.pid FROM Parts P WHERE P.color = ‘red’ AND (NOT EXISTS (SELECT C1.sid FROM Catalog C1 WHERE C1.sid = C.sid AND C1.pid = P.pid)))) OR ( NOT EXISTS (SELECT P1.pid FROM Parts P1 WHERE P1.color = ‘green’ AND (NOT EXISTS (SELECT C2.sid FROM Catalog C2 WHERE C2.sid = C.sid AND C2.pid = P1.pid)))) 9. RA ρ(R1, Catalog) ρ(R2, Catalog)
29
πR1.sid,R2.sid (σR1.pid=R2.pid∧R1.sid=R2.sid∧R1.cost>R2.cost (R1 × R2)) TRC {T | ∃T 1 ∈ Catalog(∃T 2 ∈ Catalog (T 2.pid = T 1.pid ∧ T 2.sid = T 1.sid ∧T 2.cost < T 1.cost ∧ T.sid2 = T 2.sid) ∧T.sid1 = T 1.sid)} DRC { X, P | X, Y, Z ∈ Catalog ∧ ∃P, Q, R ( P, Q, R ∈ Catalog ∧ Q = Y ∧ P = X ∧ R < Z)} SQL SELECT C1.sid, C2.sid FROM Catalog C1, Catalog C2 WHERE C1.pid = C2.pid AND C1.sid = C2.sid AND C1.cost > C2.cost
30
10. RA ρ(R1, Catalog) ρ(R2, Catalog)
Chapter 4
πR1.pid σR1.pid=R2.pid∧R1.sid=R2.sid (R1 × R2) TRC {T | ∃T 1 ∈ Catalog(∃T 2 ∈ Catalog (T 2.pid = T 1.pid ∧ T 2.sid = T 1.sid) ∧T.pid = T 1.pid)} DRC { X | X, Y, Z ∈ Catalog ∧ ∃A, B, C ( A, B, C ∈ Catalog ∧ B = Y ∧ A = X)} SQL SELECT C.sid FROM Catalog C WHERE EXISTS (SELECT C1.sid FROM Catalog C1 WHERE C1.pid = C.pid AND C1.sid = C.sid ) 11. RA ρ(R1, πsid σsname= Y osemiteSham Suppliers) ρ(R2, R1 ρ(R3, R2) ρ(R4(1 → sid, 2 → pid, 3 → cost), σR3.cost<R2.cost (R3 × R2)) πpid (R2 − πsid,pid,cost R4) TRC {T | ∃T 1 ∈ Catalog(∃X ∈ Suppliers (X.sname = Y osemiteSham ∧ X.sid = T 1.sid) ∧ ¬(∃S ∈ Suppliers (S.sname = Y osemiteSham ∧ ∃Z ∈ Catalog (Z.sid = S.sid ∧ Z.cost > T 1.cost))) ∧ T.pid = T 1.pid) DRC { Y | X, Y, Z ∈ Catalog ∧ ∃A, B, C ( A, B, C ∈ Suppliers ∧ C = Y osemiteSham ∧ A = X) ∧¬(∃P, Q, R( P, Q, R ∈ Suppliers ∧ R = Y osemiteSham ∧∃I, J, K( I, J, K ∈ Catalog(I = P ∧ K > Z))))} Catalog)
Relational Algebra and Calculus
SQL SELECT C.pid FROM Catalog C, Suppliers S WHERE S.sname = ‘Yosemite Sham’ AND C.sid = S.sid AND C.cost ≥ ALL (Select C2.cost FROM Catalog C2, Suppliers S2 WHERE S2.sname = ‘Yosemite Sham’ AND C2.sid = S2.sid)
31
Exercise 4.4 Consider the Supplier-Parts-Catalog schema from the previous question. State what the following queries compute: 1. πsname (πsid (σcolor= red P arts) 2. πsname (πsid ((σcolor= red P arts) 3. (πsname ((σcolor= red P arts) (σcost<100 Catalog) (σcost<100 Catalog) Suppliers) Suppliers))
(σcost<100 Catalog)
Suppliers)) ∩ Suppliers))
(πsname ((σcolor= green P arts) 4. (πsid ((σcolor= red P arts)
(σcost<100 Catalog)
(σcost<100 Catalog)
Suppliers)) ∩ Suppliers)) Suppliers)) ∩ Suppliers)))
(πsid ((σcolor= green P arts) 5. πsname ((πsid,sname ((σcolor= red P arts) (πsid,sname ((σcolor= green P arts) Answer 4.4 Answer not available yet.
(σcost<100 Catalog) (σcost<100 Catalog) (σcost<100 Catalog)
Exercise 4.5 Consider the following relations containing airline ﬂight information: Flights(ﬂno: integer, from: string, to: string, distance: integer, departs: time, arrives: time) Aircraft(aid: integer, aname: string, cruisingrange: integer) Certiﬁed(eid: integer, aid: integer) Employees(eid: integer, ename: string, salary: integer) Note that the Employees relation describes pilots and other kinds of employees as well; every pilot is certiﬁed for some aircraft (otherwise, he or she would not qualify as a pilot), and only pilots are certiﬁed to ﬂy. Write the following queries in relational algebra, tuple relational calculus, and domain relational calculus. Note that some of these queries may not be expressible in relational algebra (and, therefore, also not expressible in tuple and domain relational calculus)! For such queries, informally explain why they cannot be expressed. (See the exercises at the end of Chapter 5 for additional queries over the airline schema.)
32
1. Find the eids of pilots certiﬁed for some Boeing aircraft. 2. Find the names of pilots certiﬁed for some Boeing aircraft.
Chapter 4
3. Find the aids of all aircraft that can be used on non-stop ﬂights from Bonn to Madras. 4. Identify the ﬂights that can be piloted by every pilot whose salary is more than $100,000. 5. Find the names of pilots who can operate planes with a range greater than 3,000 miles but are not certiﬁed on any Boeing aircraft. 6. Find the eids of employees who make the highest salary. 7. Find the eids of employees who make the second highest salary. 8. Find the eids of employees who are certiﬁed for the largest number of aircraft. 9. Find the eids of employees who are certiﬁed for exactly three aircraft. 10. Find the total amount paid to employees as salaries. 11. Is there a sequence of ﬂights from Madison to Timbuktu? Each ﬂight in the sequence is required to depart from the city that is the destination of the previous ﬂight; the ﬁrst ﬂight must leave Madison, the last ﬂight must reach Timbuktu, and there is no restriction on the number of intermediate ﬂights. Your query must determine whether a sequence of ﬂights from Madison to Timbuktu exists for any input Flights relation instance. Answer 4.5 In the answers below RA refers to Relational Algebra, TRC refers to Tuple Relational Calculus and DRC refers to Domain Relational Calculus. 1. RA πeid (σaname=‘Boeing (Aircraf t TRC {C.eid | C ∈ Certif ied ∧ ∃A ∈ Aircraf t(A.aid = C.aid ∧ A.aname = ‘Boeing )} DRC { Ceid | Ceid, Caid ∈ Certif ied ∧ ∃Aid, AN, AR( Aid, AN, AR ∈ Aircraf t ∧Aid = Caid ∧ AN = ‘Boeing )} Certif ied))
Relational Algebra and Calculus
SQL SELECT C.eid FROM Aircraft A, Certiﬁed C WHERE A.aid = C.aid AND A.aname = ‘Boeing’ 2. RA πename (σaname=‘Boeing (Aircraf t Certif ied Employees))
33
TRC {E.ename | E ∈ Employees ∧ ∃C ∈ Certif ied (∃A ∈ Aircraf t(A.aid = C.aid ∧ A.aname = ‘Boeing ∧ E.eid = C.eid))} DRC { EN | Eid, EN, ES ∈ Employess∧ ∃Ceid, Caid( Ceid, Caid ∈ Certif ied∧ ∃Aid, AN, AR( Aid, AN, AR ∈ Aircraf t∧ Aid = Caid ∧ AN = ‘Boeing ∧ Eid = Ceid)} SQL SELECT E.ename FROM Aircraft A, Certiﬁed C, Employees E WHERE A.aid = C.aid AND A.aname = ‘Boeing’ AND E.eid = C.eid 3. RA ρ(LAtoN Y, σf rom=‘L.A. ∧to=‘N.Y. (F lights)) πaid (σcruisingrange>distance (Aircraf t × LAtoN Y )) TRC {A.aid | A ∈ Aircraf t ∧ ∃F ∈ F lights (F.f rom = ‘L.A. ∧ F.to = ‘N.Y. ∧ A.cruisingrange > F.distance)} DRC {Aid | Aid, AN, AR ∈ Aircraf t∧ (∃F N, F F, F T, F Di, F De, F A( F N, F F, F T, F Di, F De, F A ∈ F lights∧ F F = ‘L.A. ∧ F T = ‘N.Y. ∧ F Di < AR))} SQL
34
SELECT A.aid FROM Aircraft A, Flights F WHERE F.from = ‘L.A.’ AND F.to = ‘N.Y.’ AND A.cruisingrange > F.distance 4. RA πf lno (σdistance<cruisingrange∧salary>100,000 (F lights Certif ied Employees)))
Chapter 4
Aircraf t
TRC {F.f lno | F ∈ F lights ∧ ∃A ∈ Aircraf t∃C ∈ Certif ied ∃E ∈ Employees(A.cruisingrange > F.distance ∧ E.salary > 100, 000∧ A.aid = C.aid ∧ E.eid = C.eid)} DRC {F N | F N, F F, F T, F Di, F De, F A ∈ F lights∧ ∃Ceid, Caid( Ceid, Caid ∈ Certif ied∧ ∃Aid, AN, AR( Aid, AN, AR ∈ Aircraf t∧ ∃Eid, EN, ES( Eid, EN, ES ∈ Employees (AR > F Di ∧ ES > 100, 000 ∧ Aid = Caid ∧ Eid = Ceid)} SQL SELECT E.ename FROM Aircraft A, Certiﬁed C, Employees E, Flights F WHERE A.aid = C.aid AND E.eid = C.eid AND distance < cruisingrange AND salary > 100,000 5. RA ρ(R1, πeid (σcruisingrange>3000 (Aircraf t Certif ied))) πename (Employees (R1 − πeid (σaname=‘Boeing (Aircraf t
Certif ied))))
TRC {E.ename | E ∈ Employees ∧ ∃C ∈ Certif ied(∃A ∈ Aircraf t (A.aid = C.aid ∧ E.eid = C.eid ∧ A.cruisingrange > 3000))∧ ¬(∃C2 ∈ Certif ied(∃A2 ∈ Aircraf t(A2.aname = ‘Boeing ∧ C2.aid = A2.aid ∧ C2.eid = E.eid)))} DRC { EN | Eid, EN, ES ∈ Employess∧ ∃Ceid, Caid( Ceid, Caid ∈ Certif ied∧ ∃Aid, AN, AR( Aid, AN, AR ∈ Aircraf t∧ Aid = Caid ∧ Eid = Ceid ∧ AR > 3000))∧
Relational Algebra and Calculus
¬(∃Aid2, AN 2, AR2( Aid2, AN 2, AR2 ∈ Aircraf t∧ ∃Ceid2, Caid2( Ceid2, Caid2 ∈ Certif ied ∧Aid2 = Caid2 ∧ Eid = Ceid2 ∧ AN 2 = ‘Boeing )))} SQL SELECT E.ename FROM Certiﬁed C, Employees E, Aircraft A WHERE A.aid = C.aid AND E.eid = C.eid AND A.cruisingrange > 3000 AND E.eid NOT IN ( SELECT C2.eid FROM Certiﬁed C2, Aircraft A2 WHERE C2.aid = A2.aid AND A2.aname = ‘Boeing’ ) 6.
35
RA The approach to take is ﬁrst ﬁnd all the employees who do not have the highest salary. Subtract these from the original list of employees and what is left is the highest paid employees. ρ(E1, Employees) ρ(E2, Employees) ρ(E3, πE2.eid (E1 E1.salary>E2.salary E2) (πeid E1) − E3 TRC {E1.eid | E1 ∈ Employees∧¬(∃E2 ∈ Employees(E2.salary > E1.salary))}
DRC { Eid1 | Eid1, EN 1, ES1 ∈ Employess∧ ¬(∃Eid2, EN 2, ES2( Eid2, EN 2, ES2 ∈ Employess ∧ ES2 > ES1))} SQL SELECT E.eid FROM Employees E WHERE E.salary = ( Select MAX (E2.salary) FROM Employees E2 ) 7. RA The approach taken is similar to the solution for the previous exercise. First ﬁnd all the employees who do not have the highest salary. Remove these from the original list of employees and what is left is the highest paid employees. Remove the highest paid employees from the original list. What is left is the
36
Chapter 4
second highest paid employees together with the rest of the employees. Then ﬁnd the highest paid employees of this new list. This is the list of the second highest paid employees. ρ(E1, Employees) ρ(E2, Employees) ρ(E3, πE2.eid (E1 E1.salary>E2.salary E2) ρ(E4, E2 E3) ρ(E5, E2 E3) ρ(E6, πE5.eid (E4 E1.salary>E5.salary E5) (πeid E3) − E6 TRC {E1.eid | E1 ∈ Employees ∧ ∃E2 ∈ Employees(E2.salary > E1.salary ∧¬(∃E3 ∈ Employees(E3.salary > E2.salary)))} DRC { Eid1 | Eid1, EN 1, ES1 ∈ Employess∧ ∃Eid2, EN 2, ES2( Eid2, EN 2, ES2 ∈ Employess(ES2 > ES1) ∧¬(∃Eid3, EN 3, ES3( Eid3, EN 3, ES3 ∈ Employess(ES3 > ES2))))} SQL SELECT E.eid FROM Employees E WHERE E.salary = (SELECT MAX (E2.salary) FROM Employees E2 WHERE E2.salary = (SELECT MAX (E3.salary) FROM Employees E3 ))
8. This cannot be expressed in relational algebra (or calculus) because there is no operator to count, and this query requires the ability to count upto a number that depends on the data. The query can however be expressed in SQL as follows: SELECT Temp.eid FROM ( SELECT FROM GROUP BY WHERE Temp.cnt =
C.eid AS eid, COUNT (C.aid) AS cnt, Certiﬁed C C.eid) AS Temp ( SELECT MAX (Temp.cnt) FROM Temp)
9.
RA The approach behind this query is to ﬁrst ﬁnd the employees who are certiﬁed for at least three aircraft (they appear at least three times in the Certiﬁed
Relational Algebra and Calculus
37
relation). Then ﬁnd the employees who are certiﬁed for at least four aircraft. Subtract the second from the ﬁrst and what is left is the employees who are certiﬁed for exactly three aircraft. ρ(R1, Certif ied) ρ(R2, Certif ied) ρ(R3, Certif ied) ρ(R4, Certif ied) ρ(R5, πeid (σ(R1.eid=R2.eid=R3.eid)∧(R1.aid=R2.aid=R3.aid) (R1 × R2 × R3))) ρ(R6, πeid (σ(R1.eid=R2.eid=R3.eid=R4.eid)∧(R1.aid=R2.aid=R3.aid=R4.aid) (R1 × R2 × R3 × R4))) R5 − R6 TRC {C1.eid | C1 ∈ Certif ied ∧ ∃C2 ∈ Certif ied(∃C3 ∈ Certif ied (C1.eid = C2.eid ∧ C2.eid = C3.eid∧ C1.aid = C2.aid ∧ C2.aid = C3.aid ∧ C3.aid = C1.aid∧ ¬(∃C4 ∈ Certif ied (C3.eid = C4.eid ∧ C1.aid = C4.aid∧ C2.aid = C4.aid ∧ C3.aid = C4.aid))))} DRC { CE1 | CE1, CA1 ∈ Certif ied∧ ∃CE2, CA2( CE2, CA2 ∈ Certif ied∧ ∃CE3, CA3( CE3, CA3 ∈ Certif ied∧ (CE1 = CE2 ∧ CE2 = CE3∧ CA1 = CA2 ∧ CA2 = CA3 ∧ CA3 = CA1∧ ¬(∃CE4, CA4( CE4, CA4 ∈ Certif ied∧ (CE3 = CE4 ∧ CA1 = CA4∧ CA2 = CA4 ∧ CA3 = CA4))))} SQL SELECT C1.eid FROM Certiﬁed C1, Certiﬁed C2, Certiﬁed C3 WHERE (C1.eid = C2.eid AND C2.eid = C3.eid AND C1.aid = C2.aid AND C2.aid = C3.aid AND C3.aid = C1.aid) EXCEPT SELECT C4.eid FROM Certiﬁed C4, Certiﬁed C5, Certiﬁed C6, Certiﬁed C7, WHERE (C4.eid = C5.eid AND C5.eid = C6.eid AND C6.eid = C7.eid AND C4.aid = C5.aid AND C4.aid = C6.aid AND C4.aid = C7.aid AND C5.aid = C6.aid AND C5.aid = C7.aid AND C6.aid = C7.aid )
38
Chapter 4
This could also be done in SQL using COUNT. 10. This cannot be expressed in relational algebra (or calculus) because there is no operator to sum values. The query can however be expressed in SQL as follows: SELECT SUM (E.salaries) FROM Employees E
11. This cannot be expressed in relational algebra or relational calculus or SQL. The problem is that there is no restriction on the number of intermediate ﬂights. All of the query methods could ﬁnd if there was a ﬂight directly from Madison to Timbuktu and if there was a sequence of two ﬂights that started in Madison and ended in Timbuktu. They could even ﬁnd a sequence of n ﬂights that started in Madison and ended in Timbuktu as long as there is a static (i.e., data-independent) upper bound on the number of intermediate ﬂights. (For large n, this would of course be long and impractical, but at least possible.) In this query, however, the upper bound is not static but dynamic (based upon the set of tuples in the Flights relation). In summary, if we had a static upper bound (say k), we could write an algebra or SQL query that repeatedly computes (upto k) joins on the Flights relation. If the upper bound is dynamic, then we cannot write such a query because k is not known when writing the query. 12. This cannot be expressed in relational algebra (or calculus). If we had the constraint that every employee has a unique salary, then the query could be expressed with much diﬃculty in relational algebra and calculus. To do this, the 20 highest paid employees would removed from the list one by one and saved as in questions 6 and 7. Then the saved tuples could be added together with union. In practice, this is tedious. In SQL however, ORDER BY could be used to list the employees in sorted order by salary, and the user could see the top 20 rows. (Of course, this could be done in SQL even without the constraint mentioned above.)
Exercise 4.6 What is relational completeness? If a query language is relationally complete, can you write any desired query in that language? Answer 4.6 Answer omitted.
Exercise 4.7 What is an unsafe query? Give an example and explain why it is important to disallow such queries.
Relational Algebra and Calculus
39
Answer 4.7 An unsaf e query is a query in relational calculus that has an inﬁnite number of results. An example of such a query is: {S | ¬(S ∈ Sailors)} The query is for all things that are not sailors which of course is everything else. Clearly there is an inﬁnite number of answers, and this query is unsaf e. It is important to disallow unsaf e queries because we want to be able to get back to users with a list of all the answers to a query after a ﬁnite amount of time.
5
SQL: QUERIES, PROGRAMMING, TRIGGERS
Exercise 5.1 Consider the following relations: Student(snum: integer, sname: string, major: string, level: string, age: integer) Class(name: string, meets at: time, room: string, ﬁd: integer) Enrolled(snum: integer, cname: string) Faculty(ﬁd: integer, fname: string, deptid: integer) The meaning of these relations is straightforward; for example, Enrolled has one record per student-class pair such that the student is enrolled in the class. Write the following queries in SQL. No duplicates should be printed in any of the answers. 1. Find the names of all Juniors (Level = JR) who are enrolled in a class taught by I. Teach. 2. Find the age of the oldest student who is either a History major or is enrolled in a course taught by I. Teach. 3. Find the names of all classes that either meet in room R128 or have ﬁve or more students enrolled. 4. Find the names of all students who are enrolled in two classes that meet at the same time. 5. Find the names of faculty members who teach in every room in which some class is taught. 6. Find the names of faculty members for whom the combined enrollment of the courses that they teach is less than ﬁve. 7. Print the Level and the average age of students for that Level, for each Level. 40
SQL: Queries, Programming, Triggers
41
8. Print the Level and the average age of students for that Level, for all Levels except JR. 9. Find the names of students who are enrolled in the maximum number of classes. 10. Find the names of students who are not enrolled in any class. 11. For each age value that appears in Students, ﬁnd the level value that appears most often. For example, if there are more FR level students aged 18 than SR, JR, or SO students aged 18, you should print the pair (18, FR). Answer 5.1 The answers are given below: 1. SELECT DISTINCT S.Sname FROM Student S, Class C, Enrolled E, Faculty F WHERE S.snum = E.snum AND E.cname = C.name AND C.ﬁd = F.ﬁd AND F.fname = ‘I.Teach’ AND S.level = ‘JR’ SELECT MAX(S.age) FROM Student S WHERE (S.major = ‘History’) OR S.num IN (SELECT E.snum FROM Class C, Enrolled E, Faculty F WHERE E.cname = C.name AND C.ﬁd = F.ﬁd AND F.fname = ‘I.Teach’ ) SELECT FROM WHERE C.name Class C C.room = ‘R128’ OR C.name IN (SELECT FROM GROUP BY HAVING
2.
3.
E.cname Enrolled E E.cname COUNT (*) >= 5)
4.
SELECT DISTINCT S.sname FROM Student S WHERE S.snum IN (SELECT E1.snum FROM Enrolled E1, Enrolled E2, Class C1, Class C2 WHERE E1.snum = E2.snum AND E1.cname <> E2.cname AND E1.cname = C1.name AND E2.cname = C2.name AND C1.time = C2.time) SELECT DISTINCT F.fname FROM Faculty F WHERE NOT EXISTS (( SELECT *
5.
42
FROM Class C ) EXCEPT (SELECT C1.room FROM Class C1 WHERE C1.ﬁd = F.ﬁd )) 6. SELECT FROM WHERE DISTINCT F.fname Faculty F 5 > (SELECT E.snum FROM Class C, Enrolled E WHERE C.name = E.cname AND C.ﬁd = F.ﬁd)
Chapter 5
7.
SELECT S.level, AVG(S.age) FROM Student S GROUP BY S.level SELECT FROM WHERE GROUP BY SELECT FROM WHERE S.level, AVG(S.age) Student S S.level <> ‘JR’ S.level DISTINCT S.sname Student S S.snum IN (SELECT FROM GROUP BY HAVING
8.
9.
E.snum Enrolled E E.snum COUNT (*) >= ALL (SELECT COUNT (*) FROM Enrolled E2 GROUP BY E2.snum ))
10.
SELECT DISTINCT S.sname FROM Student S WHERE S.snum NOT IN (SELECT E.snum FROM Enrolled E )
Exercise 5.2 Consider the following schema: Suppliers(sid: integer, sname: string, address: string) Parts(pid: integer, pname: string, color: string) Catalog(sid: integer, pid: integer, cost: real) The Catalog relation lists the prices charged for parts by Suppliers. Write the following queries in SQL:
SQL: Queries, Programming, Triggers
1. Find the pnames of parts for which there is some supplier. 2. Find the snames of suppliers who supply every part. 3. Find the snames of suppliers who supply every red part.
43
4. Find the pnames of parts supplied by Acme Widget Suppliers and by no one else. 5. Find the sids of suppliers who charge more for some part than the average cost of that part (averaged over all the suppliers who supply that part). 6. For each part, ﬁnd the sname of the supplier who charges the most for that part. 7. Find the sids of suppliers who supply only red parts. 8. Find the sids of suppliers who supply a red part and a green part. 9. Find the sids of suppliers who supply a red part or a green part. Answer 5.2 Answer omitted.
Exercise 5.3 The following relations keep track of airline ﬂight information: Flights(ﬂno: integer, from: string, to: string, distance: integer, departs: time, arrives: time, price: integer) Aircraft(aid: integer, aname: string, cruisingrange: integer) Certiﬁed(eid: integer, aid: integer) Employees(eid: integer, ename: string, salary: integer) Note that the Employees relation describes pilots and other kinds of employees as well; every pilot is certiﬁed for some aircraft, and only pilots are certiﬁed to ﬂy. Write each of the following queries in SQL. (Additional queries using the same schema are listed in the exercises for Chapter 4.) 1. Find the names of aircraft such that all pilots certiﬁed to operate them earn more than 80,000. 2. For each pilot who is certiﬁed for more than three aircraft, ﬁnd the eid and the maximum cruisingrange of the aircraft that he (or she) is certiﬁed for. 3. Find the names of pilots whose salary is less than the price of the cheapest route from Los Angeles to Honolulu. 4. For all aircraft with cruisingrange over 1,000 miles, ﬁnd the name of the aircraft and the average salary of all pilots certiﬁed for this aircraft. 5. Find the names of pilots certiﬁed for some Boeing aircraft.
44
Chapter 5
6. Find the aids of all aircraft that can be used on routes from Los Angeles to Chicago. 7. Identify the routes that can be piloted by every pilot who makes more than $100,000. 8. Print the enames of pilots who can operate planes with cruisingrange greater than 3,000 miles, but are not certiﬁed on any Boeing aircraft. 9. A customer wants to travel from Madison to New York with no more than two changes of ﬂight. List the choice of departure times from Madison if the customer wants to arrive in New York by 6 p.m. 10. Compute the diﬀerence between the average salary of a pilot and the average salary of all employees (including pilots). 11. Print the name and salary of every nonpilot whose salary is more than the average salary for pilots. Answer 5.3 The answers are given below: 1. SELECT DISTINCT A.aname FROM Aircraft A WHERE A.Aid IN (SELECT C.aid FROM Certiﬁed C, Employees E WHERE C.eid = E.eid AND NOT EXISTS ( SELECT * FROM Employees E1 WHERE E1.eid = E.eid AND E1.salary < 80000 )) C.eid, MAX (A.cruisingrange) Certiﬁed C, Aircraft A C.aid = A.aid C.eid COUNT (*) > 3 DISTINCT E.aname Employee E E.salary < ( SELECT MIN (F.price) Flights F F.from = ‘LA’ AND F.to = ‘Honolulu’ )
2. SELECT FROM WHERE GROUP BY HAVING 3.
SELECT FROM WHERE FROM WHERE
4. Observe that aid is the key for Aircraft, but the question asks for aircraft names; we deal with this complication by using an intermediate relation Temp:
SQL: Queries, Programming, Triggers
SELECT Temp.name, Temp.AvgSalary FROM ( SELECT A.aid, A.aname AS name, AVG (E.salary) AS AvgSalary FROM Aircraft A, Certiﬁed C, Employes E WHERE A.aid = C.aid AND C.eid = E.eid AND A.cruisingrange > 1000 GROUP BY A.aid, A.aname ) AS Temp 5. SELECT DISTINCT E.ename FROM Employees E, Certiﬁed C, Aircraft A WHERE E.eid = C.eid AND C.aid = A.aid AND A.aname = ‘Boeing’
45
6.
SELECT A.aid FROM Aircraft A WHERE A.cruisingrange > ( SELECT MIN (F.distance) FROM Flights F WHERE F.from = ‘L.A.’ AND F.to = ‘Chicago’ ) SELECT DISTINCT F.from, F.to FROM Flights F WHERE NOT EXISTS ( SELECT * FROM Employees E WHERE E.salary > 100000 AND NOT EXISTS (SELECT * FROM Aircraft A, Certiﬁed C WHERE A.cruisingrange > F.distance AND E.eid = C.eid AND A.eid = C.aid) ) SELECT DISTINCT E.ename FROM Employees E, Certiﬁed C, Aircraft A WHERE C.eid = E.eid AND C.aid = A.aid AND A.cruisingrange > 3000 AND E.eid NOT IN ( SELECT C1.eid FROM Certiﬁed C1, Aircraft A1 WHERE C1.aid = A1.aid AND A1.aname = ‘Boeing’ )
7.
8.
46
9.
Chapter 5
SELECT F.departs FROM Flights F WHERE F.ﬂno IN ( ( SELECT F0.ﬂno FROM Flights F0 WHERE F0.from = ‘Madison’ AND F0.to = ‘NY’ AND AND F0.arrives < 1800 ) UNION ( SELECT F0.ﬂno FROM Flights F0, Flights F1 WHERE F0.from = ‘Madison’ AND F0.to <> ‘NY’ AND AND F0.to = F1.from AND F1.to = ‘NY’ F1.departs > F0.arrives AND F1.arrives < 1800 ) UNION ( SELECT F0.ﬂno FROM Flights F0, Flights F1, Flights F2 WHERE F0.from = ‘Madison’ WHERE F0.to = F1.from AND F1.to = F2.from AND F2.to = ‘NY’ AND F0.to <> ‘NY’ AND F1.to <> ‘NY’ AND F1.departs > F0.arrives AND F2.departs > F1.arrives AND F2.arrives < 1800 )) SELECT Temp1.avg - Temp2.avg FROM ( SELECT AVG (E.salary) AS avg FROM Employees E WHERE E.eid IN (SELECT DISTINCT C.eid FROM Certiﬁed C )) AS Temp1, ( SELECTAVG (E1.salary) AS avg FROM Employees E1 ) AS Temp2 SELECT E.ename, E.salary FROM Employees E WHERE E.eid NOT IN ( SELECT DISTINCT C.eid FROM Certiﬁed C ) AND E.salary > ( SELECT AVG (E1.salary) FROM Employees E1 WHERE E1.eid IN ( SELECT DISTINCT C1.eid FROM Certiﬁed C1 ) )
10.
11.
SQL: Queries, Programming, Triggers
47
Exercise 5.4 Consider the following relational schema. An employee can work in more than one department; the pct time ﬁeld of the Works relation shows the percentage of time that a given employee works in a given department. Emp(eid: integer, ename: string, age: integer, salary: real) Works(eid: integer, did: integer, pct time: integer) Dept(did: integer, budget: real, managerid: integer) Write the following queries in SQL: 1. Print the names and ages of each employee who works in both the Hardware department and the Software department. 2. For each department with more than 20 full-time-equivalent employees (i.e., where the part-time and full-time employees add up to at least that many full-time employees), print the did together with the number of employees that work in that department. 3. Print the name of each employee whose salary exceeds the budget of all of the departments that he or she works in. 4. Find the managerids of managers who manage only departments with budgets greater than $1,000,000. 5. Find the enames of managers who manage the departments with the largest budget. 6. If a manager manages more than one department, he or she controls the sum of all the budgets for those departments. Find the managerids of managers who control more than $5,000,000. 7. Find the managerids of managers who control the largest amount. Answer 5.4 Answer omitted.
Exercise 5.5 Consider the instance of the Sailors relation shown in Figure 5.1. 1. Write SQL queries to compute the average rating, using AVG; the sum of the ratings, using SUM; and the number of ratings, using COUNT. 2. If you divide the sum computed above by the count, would the result be the same as the average? How would your answer change if the above steps were carried out with respect to the age ﬁeld instead of rating?
48
Chapter 5
sid 18 41 22 63
sname jones jonah ahab moby
rating 3 6 7 null
age 30.0 56.0 44.0 15.0
Figure 5.1
An Instance of Sailors
3. Consider the following query: Find the names of sailors with a higher rating than all sailors with age < 21. The following two SQL queries attempt to obtain the answer to this question. Do they both compute the result? If not, explain why. Under what conditions would they compute the same result? SELECT S.sname FROM Sailors S WHERE NOT EXISTS ( SELECT * FROM Sailors S2 WHERE S2.age < 21 AND S.rating <= S2.rating ) SELECT * FROM Sailors S WHERE S.rating > ANY ( SELECT S2.rating FROM Sailors S2 WHERE S2.age < 21 ) 4. Consider the instance of Sailors shown in Figure 5.1. Let us deﬁne instance S1 of Sailors to consist of the ﬁrst two tuples, instance S2 to be the last two tuples, and S to be the given instance. (a) Show the left outer join of S with itself, with the join condition being sid=sid. (b) Show the right outer join of S with itself, with the join condition being sid=sid. (c) Show the full outer join of S with itself, with the join condition being sid=sid. (d) Show the left outer join of S1 with S2, with the join condition being sid=sid. (e) Show the right outer join of S1 with S2, with the join condition being sid=sid. (f) Show the full outer join of S1 with S2, with the join condition being sid=sid. Answer 5.5 The answers are shown below:
SQL: Queries, Programming, Triggers
sid 4. (a) 18 41 22 63 sname jones jonah ahab moby rating 3 6 7 null age 30.0 56.0 44.0 15.0 sid 18 41 22 63 sname jones jonah ahab moby rating 3 6 7 null age 30.0 56.0 44.0 15.0
49
1.
SELECT AVG (S.rating) AS AVERAGE FROM Sailors S SELECT SUM (S.rating) FROM Sailors S SELECT COUNT (S.rating) FROM Sailors S
2. The result using SUM and COUNT would be smaller than the result using AVERAGE if there are tuples with rating = NULL. This is because all the aggregate operators, except for COUNT, ignore NULL values. So the ﬁrst approach would compute the average over all tuples while the second approach would compute the average over all tuples with non-NULL rating values. However, if the aggregation is done on the age ﬁeld, the answers using both approaches would be the same since the age ﬁeld does not take NULL values. 3. Only the ﬁrst query is correct. The second query returns the names of sailors with a higher rating than at least one sailor with age < 21. Note that the answer to the second query does not necessarily contain the answer to the ﬁrst query. In particular, if all the sailors are at least 21 years old, the second query will return an empty set while the ﬁrst query will return all the sailors. This is because the NOT EXISTS predicate in the ﬁrst query will evaluate to true if its subquery evaluates to an empty set, while the ANY predicate in the second query will evaluate to false if its subquery evaluates to an empty set. The two queries give the same results if and only if one of the following two conditions hold: The Sailors relation is empty, or There is at least one sailor with age > 21 in the Sailors relation, and for every sailor s, either s has a higher rating than all sailors under 21 or s has a rating no higher than all sailors under 21.
Exercise 5.6 Answer the following questions. 1. Explain the term impedance mismatch in the context of embedding SQL commands in a host language such as C.
50
Chapter 5
sid (b) 18 41 22 63
sname jones jonah ahab moby
rating 3 6 7 null
age 30.0 56.0 44.0 15.0
sid 18 41 22 63
sname jones jonah ahab moby
rating 3 6 7 null
age 30.0 56.0 44.0 15.0
sid (c) 18 41 22 63
sname jones jonah ahab moby
rating 3 6 7 null
age 30.0 56.0 44.0 15.0
sid 18 41 22 63
sname jones jonah ahab moby
rating 3 6 7 null
age 30.0 56.0 44.0 15.0
sid (d) 18 41
sname jones jonah
rating 3 6
age 30.0 56.0
sid null null
sname null null
rating null null
age null null
sid (e) null null
sname null null
rating null null
age null null
sid 22 63
sname ahab moby
rating 7 null
age 44.0 15.0
sid (f) 18 41 null null
sname jones jonah null null
rating 3 6 null null
age 30.0 56.0 null null
sid null null 22 63
sname null null ahab moby
rating null null 7 null
age null null 44.0 15.0
SQL: Queries, Programming, Triggers
51
2. How can the value of a host language variable be passed to an embedded SQL command? 3. Explain the WHENEVER command’s use in error and exception handling. 4. Explain the need for cursors. 5. Give an example of a situation that calls for the use of embedded SQL, that is, interactive use of SQL commands is not enough, and some host language capabilities are needed. 6. Write a C program with embedded SQL commands to address your example in the previous answer. 7. Write a C program with embedded SQL commands to ﬁnd the standard deviation of sailors’ ages. 8. Extend the previous program to ﬁnd all sailors whose age is within one standard deviation of the average age of all sailors. 9. Explain how you would write a C program to compute the transitive closure of a graph, represented as an SQL relation Edges(from, to), using embedded SQL commands. (You don’t have to write the program; just explain the main points to be dealt with.) 10. Explain the following terms with respect to cursors: updatability, sensitivity, and scrollability. 11. Deﬁne a cursor on the Sailors relation that is updatable, scrollable, and returns answers sorted by age. Which ﬁelds of Sailors can such a cursor not update? Why? 12. Give an example of a situation that calls for dynamic SQL, that is, even embedded SQL is not suﬃcient. Answer 5.6 Answer omitted.
Exercise 5.7 Consider the following relational schema and brieﬂy answer the questions that follow: Emp(eid: integer, ename: string, age: integer, salary: real) Works(eid: integer, did: integer, pct time: integer) Dept(did: integer, budget: real, managerid: integer)
1. Deﬁne a table constraint on Emp that will ensure that every employee makes at least $10,000.
52
Chapter 5
2. Deﬁne a table constraint on Dept that will ensure that all managers have age > 30. 3. Deﬁne an assertion on Dept that will ensure that all managers have age > 30. Compare this assertion with the equivalent table constraint. Explain which is better. 4. Write SQL statements to delete all information about employees whose salaries exceed that of the manager of one or more departments that they work in. Be sure to ensure that all the relevant integrity constraints are satisﬁed after your updates. Answer 5.7 The answers are given below: 1. Deﬁne a table constraint on Emp that will ensure that every employee makes at least 10,000 CREATE TABLE Emp ( eid INTEGER, ename CHAR(10), age INTEGER , salary REAL, PRIMARY KEY (eid), CHECK ( salary >= 10000 )) 2. Deﬁne a table constraint on Dept that will ensure that all managers have age > 30 CREATE TABLE Dept ( did INTEGER, buget REAL, managerid INTEGER , PRIMARY KEY (did), FOREIGN KEY (managerid) REFERENCES Emp, CHECK ( ( SELECT E.age FROM Emp E, Dept D) WHERE E.eid = D.managerid ) > 30 ) 3. Deﬁne an assertion on Dept that will ensure that all managers have age > 30 CREATE TABLE Dept ( did INTEGER, budget REAL, managerid INTEGER , PRIMARY KEY (did) )
CREATE ASSERTION managerAge CHECK ( (SELECT E.age FROM Emp E, Dept D WHERE E.eid = D.managerid ) > 30 )
SQL: Queries, Programming, Triggers
53
Since the constraint involves two relations, it is better to deﬁne it as an assertion, independent of any one relation, rather than as a check condition on the Dept relation. The limitation of the latter approach is that the condition is checked only when the Dept relation is being updated. However, since age is an attribute of the Emp relation, it is possible to update the age of a manager which violates the constraint. So the former approach is better since it checks for potential violation of the assertion whenever one of the relations is updated. 4. To write such statements, it is necessary to consider the constraints deﬁned over the tables. We will assume the following: CREATE TABLE Emp ( eid INTEGER, ename CHAR(10), age INTEGER, salary REAL, PRIMARY KEY (eid) ) CREATE TABLE Works ( eid INTEGER, did INTEGER, pcttime INTEGER, PRIMARY KEY (eid, did), FOREIGN KEY (did) REFERENCES Dept, FOREIGN KEY (eid) REFERENCES Emp, ON DELETE CASCADE) CREATE TABLE Dept ( did INTEGER, buget REAL, managerid INTEGER , PRIMARY KEY (did), FOREIGN KEY (managerid) REFERENCES Emp, ON DELETE SET NULL)
Now, we can deﬁne statements to delete employees who make more than one of their managers: DELETE FROM Emp E WHERE E.eid IN ( SELECT W.eid FROM Work W, Emp E2, Dept D WHERE W.did = D.did AND D.managerid = E2.eid AND E.salary > E2.salary )
Exercise 5.8 Consider the following relations: Student(snum: integer, sname: string, major: string,
54
Chapter 5
level: string, age: integer) Class(name: string, meets at: time, room: string, ﬁd: integer) Enrolled(snum: integer, cname: string) Faculty(ﬁd: integer, fname: string, deptid: integer)
The meaning of these relations is straightforward; for example, Enrolled has one record per student-class pair such that the student is enrolled in the class. 1. Write the SQL statements required to create the above relations, including appropriate versions of all primary and foreign key integrity constraints. 2. Express each of the following integrity constraints in SQL unless it is implied by the primary and foreign key constraint; if so, explain how it is implied. If the constraint cannot be expressed in SQL, say so. For each constraint, state what operations (inserts, deletes, and updates on speciﬁc relations) must be monitored to enforce the constraint. (a) Every class has a minimum enrollment of 5 students and a maximum enrollment of 30 students. (b) At least one class meets in each room. (c) Every faculty member must teach at least two courses. (d) Only faculty in the department with deptid=33 teach more than three courses. (e) Every student must be enrolled in the course called Math101. (f) The room in which the earliest scheduled class (i.e., the class with the smallest meets at value) meets should not be the same as the room in which the latest scheduled class meets. (g) Two classes cannot meet in the same room at the same time. (h) The department with the most faculty members must have fewer than twice the number of faculty members in the department with the fewest faculty members. (i) No department can have more than 10 faculty members. (j) A student cannot add more than two courses at a time (i.e., in a single update). (k) The number of CS majors must be more than the number of Math majors. (l) The number of distinct courses in which CS majors are enrolled is greater than the number of distinct courses in which Math majors are enrolled. (m) The total enrollment in courses taught by faculty in the department with deptid=33 is greater than the number of Math majors. (n) There must be at least one CS major if there are any students whatsoever. (o) Faculty members from diﬀerent departments cannot teach in the same room.
SQL: Queries, Programming, Triggers
Answer 5.8 Answer omitted.
55
Exercise 5.9 Discuss the strengths and weaknesses of the trigger mechanism. Contrast triggers with other integrity constraints supported by SQL. Answer 5.9 Answer not available yet.
Exercise 5.10 Consider the following relational schema. An employee can work in more than one department; the pct time ﬁeld of the Works relation shows the percentage of time that a given employee works in a given department. Emp(eid: integer, ename: string, age: integer, salary: real) Works(eid: integer, did: integer, pct time: integer) Dept(did: integer, budget: real, managerid: integer) Write SQL-92 integrity constraints (domain, key, foreign key, or CHECK constraints; or assertions) or SQL:1999 triggers to ensure each of the following requirements, considered independently. 1. Employees must make a minimum salary of $1,000. 2. Every manager must be also be an employee. 3. The total percentage of all appointments for an employee must be under 100%. 4. A manager must always have a higher salary than any employee that he or she manages. 5. Whenever an employee is given a raise, the manager’s salary must be increased to be at least as much. 6. Whenever an employee is given a raise, the manager’s salary must be increased to be at least as much. Further, whenever an employee is given a raise, the department’s budget must be increased to be greater than the sum of salaries of all employees in the department. Answer 5.10 Answer omitted.
6
QUERY-BY-EXAMPLE (QBE)
Exercise 6.1 Consider the following relational schema. An employee can work in more than one department. Emp(eid: integer, ename: string, salary: real) Works(eid: integer, did: integer) Dept(did: integer, dname: string, managerid: integer, ﬂoornum: integer) Write the following queries in QBE. Be sure to underline your variables to distinguish them from your constants. 1. Print the names of all employees who work on the 10th ﬂoor and make less than $50,000. 2. Print the names of all managers who manage three or more departments on the same ﬂoor. 3. Print the names of all managers who manage 10 or more departments on the same ﬂoor. 4. Give every employee who works in the toy department a 10 percent raise. 5. Print the names of the departments that employee Santa works in. 6. Print the names and salaries of employees who work in both the toy department and the candy department. 7. Print the names of employees who earn a salary that is either less than $10,000 or more than $100,000. 8. Print all of the attributes for employees who work in some department that employee Santa also works in. 9. Fire Santa. 56
Query-by-Example (QBE)
57
10. Print the names of employees who make more than $20,000 and work in either the video department or the toy department. 11. Print the names of all employees who work on the ﬂoor(s) where Jane Dodecahedron works. 12. Print the name of each employee who earns more than the manager of the department that he or she works in. 13. Print the name of each department that has a manager whose last name is Psmith and who is neither the highest-paid nor the lowest-paid employee in the department. Answer 6.1 1. Names of all employees who work on the 10th ﬂoor and make less than 50,000: Emp eid E ename P. eid E salary < 50, 000 did D f loor 10
W orks
Dept
did D
dname
managerid
2. Names of all managers who manage three or more departments on the same ﬂoor: Emp eid M Dept did D1 D2 D3 Conditions D1 = D2 AND D1 = D3 AND D2 = D3 3. Print the names of all managers who manage ten or more departments on the same ﬂoor: Emp eid M ename P.G. N salary ename P. N managerid M M M f loor F F F salary
dname
58
Dept did D dname managerid G. M f loor G.
Chapter 6
COUNT. D>= 10
4. Give every employee who works in the Toy department a 10% raise: Emp eid E W orks ename salary U. S∗1.1 did D f loor
eid E
Dept
did D
dname Toy
managerid
5. Print the names of the departments that employee Santa works in: Emp eid E W orks ename Santa eid E Dept did D dname P. N did D f loor salary
managerid
6. Print the names and salaries of employees who work in both the Toy department and the Candy department: Emp UNQ. eid E W orks ename P. EN eid E f loor salary P. ES did D
Dept
did D D
dname DN1 DN2
managerid
Conditions DN1=Toy AND DN2=Candy
7. Print the names of employees who earn a salary that is either less than 10,000 or more than 100,000:
Query-by-Example (QBE)
Emp eid ename P. N salary S S< 10, 000 OR S> 100, 000
59
8. Print all of the attributes for employees who work in some department that employee Santa also works in: Emp P. eid E E1 ename Santa E¬ E1 eid E E1 did D D salary
W orks
9. Fire Santa: Emp D. eid E W orks D. ename Santa eid E did salary
10. Print the names of employees who make more than 20,000 and work in either the Video department or the Toy department: Emp eid E W orks ename P. EN eid E Dept did D dname DN managerid f loor salary >20000 did D Conditions DN=Video OR DN=Toy
11. Print the names of all employees who work on the ﬂoor(s) where Jane Dodecahedron works: Emp eid E1 E2 ename Jane Dodecahedron P. N salary
60
W orks eid E1 E2 Dept did D1 D2 dname did D1 D2 f loor F F
Chapter 6
managerid
12. Print the name of each employee who earns more than the manager of the department that he or she works in: Emp eid M E W orks ename P. N eid E Dept did D dname salary S >S did D f loor
managerid M
13. Print the name of each department that has a manager whose last name is Psmith and who is neither the highest-paid nor the lowest-paid employee in the department: W orks eid E Emp eid E M ename NLIKE ‘%.Psmith’ T emp I. did D D2 did G. D salary S S2 S2> A AND S2< B M ax MAX. S B f loor
M in MIN. S A
Dept
did D2
dname P. DN
managerid M
Exercise 6.2 Write the following queries in QBE, based on this schema:
Query-by-Example (QBE)
Suppliers(sid: integer, sname: string, city: string) Parts(pid: integer, pname: string, color: string) Orders(sid: integer, pid: integer, quantity: integer)
61
1. For each supplier from whom all of the following things have been ordered in quantities of at least 150, print the name and city of the supplier: a blue gear, a red crankshaft, and a yellow bumper. 2. Print the names of the purple parts that have been ordered from suppliers located in Madison, Milwaukee, or Waukesha. 3. Print the names and cities of suppliers who have an order for more than 150 units of a yellow or purple part. 4. Print the pids of parts that have been ordered from a supplier named American but have also been ordered from some supplier with a diﬀerent name in a quantity that is greater than the American order by at least 100 units. 5. Print the names of the suppliers located in Madison. Could there be any duplicates in the answer? 6. Print all available information about suppliers that supply green parts. 7. For each order of a red part, print the quantity and the name of the part. 8. Print the names of the parts that come in both blue and green. (Assume that no two distinct parts can have the same name and color.) 9. Print (in ascending order alphabetically) the names of parts supplied both by a Madison supplier and by a Berkeley supplier. 10. Print the names of parts supplied by a Madison supplier, but not supplied by any Berkeley supplier. Could there be any duplicates in the answer? 11. Print the total number of orders. 12. Print the largest quantity per order for each sid such that the minimum quantity per order for that supplier is greater than 100. 13. Print the average quantity per order of red parts. 14. Can you write this query in QBE? If so, how? Print the sids of suppliers from whom every part has been ordered. Answer 6.2 Answer omitted.
Exercise 6.3 Answer the following questions:
62
1. Describe the various uses for unnamed columns in QBE. 2. Describe the various uses for a conditions box in QBE. 3. What is unusual about the treatment of duplicates in QBE?
Chapter 6
4. Is QBE based upon relational algebra, tuple relational calculus, or domain relational calculus? Explain brieﬂy. 5. Is QBE relationally complete? Explain brieﬂy. 6. What restrictions does QBE place on update commands? Answer 6.3 1. If we want to display some information in addition to ﬁelds retrieved from a relation, we can do this by creating unnamed ﬁelds for display. For example: Sailors sid sname P. rating R age A
P. R / A
If we want to display ﬁelds from more than one table, we can use unnamed columns. For example: Sailors sid Id sname P. sid Id 2. Conditions boxes are used to do the following: Express a condition involving two or more columns, e.g., ‘ R / A >0.2’. Express a condition involving an aggregate operation. This is similiar to the HAVING clause in SQL. For example: Suppliers sid P. S Suppliers sid P. S sname city G.P. C sname Conditions COUNT. S>5 city C bid date D rating age P. D
Reserves
Express conditions involving the AND and OR operators. For Example:
Conditions C=Madison OR C=Milwaukee OR C=Waukesha 3. The default treatment of duplicates in QBE is unusual. If the query contains a single row with P., the default is that duplicates are not eliminated. If the query contains more than one such row, duplicates are eliminated by default. In either case, you can explicitly specify whether duplicates are to be eliminated (or not) by putting ALL. ( resp. UNQ.) under the relation name. For Example:
Query-by-Example (QBE)
Sailors UNQ. sid sname P. N rating age P. A
63
On the following query, duplicates are eliminated by default, and the name of each sailor in this age range is printed once. Sailors sid sname P. S P. S rating age < 30 > 20
Putting ALL. in the ﬁrst column (of one of the rows) results in printing the names of qualifying sailors as often as there are sailors with this name in the given age range. 4. Yes, QBE is based upon Doman Relational Calculus. A user writes queries by creating example tables. QBE uses domain variables, as in the domain relational calculus (DRC), to create example tables. The domain of a variable is determined by the column in which it appears, and variable symbols are preﬁxed with ‘ ’ to distinguish them from constants. Constants, including strings, appear unquoted, in contrast to SQL. The ﬁelds that should appear in the answer are speciﬁed by using the command ‘P.’, which stands for ‘print’. The ﬁelds containing this command are analoguous to the target-list in the SELECT clause of an SQL query. 5. Yes. QBE cannot accomplish some Queries without the use of aggregate operator, unless it make use of the update commands to create a temporary relation or view. Therefore, taking the update commands into account, QBE is relationally complete, even without the aggregate operators. To understand the diﬃculty of expressing division in QBE, consider the following query: Find sailors who have reserved all boats. Sailors sid Id Boats bid B Reserves ¬ sid Id BadSids I. bid B sid Id date bname color sname rating age
64
Chapter 6
Given the view BadSids, it is a simple matter to ﬁnd sailors whose sid’s are not in this view. The ideas in this example can be extended to show that QBE is relationally complete.
6. There are some restrictions on the use of the I., D. and U. commands. First, we cannot mix these operators in a single example table (or combine them with P.). Second, we cannot specify I., D. or U. in an example table that contains G.. Third, we cannot insert, update, or modify tuples based upon values in ﬁelds of other tuples (in the same table, or diﬀerent tables). Thus, the following update is incorrect: Sailors sid sname john joe rating age U. A1 A
7
STORING DATA: DISKS AND FILES
Exercise 7.1 What is the most important diﬀerence between a disk and a tape? Answer 7.1 Tapes are sequential devices that do not support direct access to a desired page. We must essentially step through all pages in order. Disks support direct access to a desired page. Exercise 7.2 Explain the terms seek time, rotational delay, and transfer time. Answer 7.2 Answer omitted.
Exercise 7.3 Both disks and main memory support direct access to any desired location (page). On average, main memory accesses are faster, of course. What is the other important diﬀerence (from the perspective of the time required to access a desired page)? Answer 7.3 The time to access a disk page is not constant. It depends on the location of the data. Accessing to some data might be much faster than to others. It is diﬀerent for memory. Access to memory is uniform for most computer systems.
Exercise 7.4 If you have a large ﬁle that is frequently scanned sequentially, explain how you would store the pages in the ﬁle on a disk. Answer 7.4 Answer omitted. Exercise 7.5 Consider a disk with a sector size of 512 bytes, 2,000 tracks per surface, 50 sectors per track, 5 double-sided platters, average seek time of 10 msec. 1. What is the capacity of a track in bytes? What is the capacity of each surface? What is the capacity of the disk? 65
66
2. How many cylinders does the disk have?
Chapter 7
3. Give examples of valid block sizes. Is 256 bytes a valid block size? 2,048? 51,200? 4. If the disk platters rotate at 5,400 rpm (revolutions per minute), what is the maximum rotational delay? 5. Assuming that one track of data can be transferred per revolution, what is the transfer rate? Answer 7.5 1. bytes/track = bytes/sector × sectors/track = 512 × 50 = 25K bytes/surf ace = bytes/track × tracks/surf ace = 25K × 2000 = 50, 000K bytes/disk = bytes/surf ace × surf aces/disk = 50, 000K × 10 = 500, 000K 2. The number of cylinders is the same as the number of tracks on each platter, which is 2000. 3. The block size should be a multiple of the sector size. We can see that 256 is not a valid block size while 2048 and 51200 are. 4. If the disk platters rotate at 5400rpm, the time required for a rotation, which is the maximum rotational delay, is 1 × 60 = 0.011seconds 5400 . The average rotational delay is half of the rotation time, 0.006 seconds. 5. The capacity of a track is 25K bytes. Since one track of data can be transferred per revolution, the data transfer rate is 25K = 2, 250Kbytespersec 0.011 Exercise 7.6 Consider again the disk speciﬁcations from Exercise 7.5 and suppose that a block size of 1,024 bytes is chosen. Suppose that a ﬁle containing 100,000 records of 100 bytes each is to be stored on such a disk and that no record is allowed to span two blocks. 1. How many records ﬁt onto a block? 2. How many blocks are required to store the entire ﬁle? If the ﬁle is arranged sequentially on disk, how many surfaces are needed? 3. How many records of 100 bytes each can be stored using this disk?
Storing Data: Disks and Files
67
4. If pages are stored sequentially on disk, with page 1 on block 1 of track 1, what is the page stored on block 1 of track 1 on the next disk surface? How would your answer change if the disk were capable of reading/writing from all heads in parallel? 5. What is the time required to read a ﬁle containing 100,000 records of 100 bytes each sequentially? Again, how would your answer change if the disk were capable of reading/writing from all heads in parallel (and the data was arranged optimally)? 6. What is the time required to read a ﬁle containing 100,000 records of 100 bytes each in some random order? Note that in order to read a record, the block containing the record has to be fetched from disk. Assume that each block request incurs the average seek time and rotational delay. Answer 7.6 Answer omitted.
Exercise 7.7 Explain what the buﬀer manager must do to process a read request for a page. What happens if the requested page is in the pool but not pinned? Answer 7.7 When a page is requested the buﬀer manager does the following: 1. The buﬀer pool is checked to see if it contains the requested page. If the page is not in the pool, it is brought in as follows: (a) A frame is chosen for replacement, using the replacement policy. (b) If the frame chosen for replacement is dirty, it is ﬂushed (the page it contains is written out to disk). (c) The requested page is read into the frame chosen for replacement. 2. The requested page is pinned (the pin count of its frame is incremented) and its address is returned to the requestor. Note that if the page is not pinned,it could be removed from buﬀer pool even if it is actually needed in main memory.
Exercise 7.8 When does a buﬀer manager write a page to disk? Answer 7.8 Answer omitted.
Exercise 7.9 What does it mean to say that a page is pinned in the buﬀer pool? Who is responsible for pinning pages? Who is responsible for unpinning pages?
68
Chapter 7
Answer 7.9 1. Pinning a page means the pin count of its frame is incremented. Pinning a page guarantees higher-level DBMS software that the page will not be removed from the buﬀer pool by the buﬀer manager. That is, another ﬁle page will not be read into the frame containing this page until it is unpinned by this requestor. 2. It is the buﬀer manager’s responsibility to pin a page. 3. It is the responsibility of the requestor of that page to tell the buﬀer manager to unpin a page.
Exercise 7.10 When a page in the buﬀer pool is modiﬁed, how does the DBMS ensure that this change is propagated to disk? (Explain the role of the buﬀer manager as well as the modiﬁer of the page.) Answer 7.10 Answer omitted.
Exercise 7.11 What happens if there is a page request when all pages in the buﬀer pool are dirty? Answer 7.11 If there are some unpinned pages, the buﬀer manager chooses one by using a replacement policy, ﬂushes this page, and then replaces it with the requested page. If there are no unpinned pages, the buﬀer manager has to wait until an unpinned page is available (or signal an error condition to the page requestor).
Exercise 7.12 What is sequential ﬂooding of the buﬀer pool? Answer 7.12 Answer omitted.
Exercise 7.13 Name an important capability of a DBMS buﬀer manager that is not supported by a typical operating system’s buﬀer manager. Answer 7.13 1. Pinning a page to prevent it from being replaced.
2. Ability to explicitly force a single page to disk.
Exercise 7.14 Explain the term prefetching. Why is it important? Answer 7.14 Answer omitted.
Storing Data: Disks and Files
69
Exercise 7.15 Modern disks often have their own main memory caches, typically about one MB, and use this to do prefetching of pages. The rationale for this technique is the empirical observation that if a disk page is requested by some (not necessarily database!) application, 80 percent of the time the next page is requested as well. So the disk gambles by reading ahead. 1. Give a nontechnical reason that a DBMS may not want to rely on prefetching controlled by the disk. 2. Explain the impact on the disk’s cache of several queries running concurrently, each scanning a diﬀerent ﬁle. 3. Can the above problem be addressed by the DBMS buﬀer manager doing its own prefetching? Explain. 4. Modern disks support segmented caches, with about four to six segments, each of which is used to cache pages from a diﬀerent ﬁle. Does this technique help, with respect to the above problem? Given this technique, does it matter whether the DBMS buﬀer manager also does prefetching? Answer 7.15 1. The pre-fetching done at the disk level varies widely across diﬀerent drives and manufacturers, and pre-fetching is suﬃciently important to a DBMS that one would like to be independent of speciﬁc hardware support. 2. If there are many queries running concurrently, the request of a page from diﬀerent queries can be interleaved. In the worst case, it cause the cache miss on every page request, even with disk pre-fetching. 3. If we have pre-fetching oﬀered by DBMS buﬀer manager, the buﬀer manager can predict the reference pattern more accurately. In particular, a certain number of buﬀer frames can be allocated per active scan for pre-fetching purposes, and interleaved requests would not compete for the same frames.
Exercise 7.16 Describe two possible record formats. What are the trade-oﬀs between them? Answer 7.16 Answer omitted.
Exercise 7.17 Describe two possible page formats. What are the trade-oﬀs between them? Answer 7.17 Two possible page formats are: consecutive slots and slot directory The consecutive slots organization is mostly used for ﬁxed length record formats. It handles the deletion by using bitmaps or linked lists.
70
Chapter 7
The slot directory organization maintains a directory of slots for each page, with a ¡record oﬀset, record length¿ pair per slot. The slot directory is an indirect way to get the oﬀset of a entry. Because of this indirection, deletion is easy. It is accomplished by setting the length ﬁeld to 0. And records can easily be moved around on the page without changing their external identiﬁer.
Exercise 7.18 Consider the page format for variable-length records that uses a slot directory. 1. One approach to managing the slot directory is to use a maximum size (i.e., a maximum number of slots) and to allocate the directory array when the page is created. Discuss the pros and cons of this approach with respect to the approach discussed in the text. 2. Suggest a modiﬁcation to this page format that would allow us to sort records (according to the value in some ﬁeld) without moving records and without changing the record ids. Answer 7.18 Answer omitted.
Exercise 7.19 Consider the two internal organizations for heap ﬁles (using lists of pages and a directory of pages) discussed in the text. 1. Describe them brieﬂy and explain the trade-oﬀs. Which organization would you choose if records are variable in length? 2. Can you suggest a single page format to implement both internal ﬁle organizations? Answer 7.19 1. The list of pages in shown in Fig 3.7. The directory of pages is shown in Fig 3.8. 2. The linked-list approach is a little simpler, but ﬁnding a page with suﬃcient free space for a new record (especially with variable length records) is harder. We have to essentially scan the list of pages until we ﬁnd one with enough space, whereas the directory organization allows us to ﬁnd such a page by simply scanning the directory, which is much smaller than the entire ﬁle. The directory organization is therefore better, especially with variable length records. 3. A page format with previous and next page pointers would help in both cases. Obviously, such a page format allows us to build the linked list organization; it is also useful for implementing the directory in the directory organization.
Storing Data: Disks and Files
71
Exercise 7.20 Consider a list-based organization of the pages in a heap ﬁle in which two lists are maintained: a list of all pages in the ﬁle and a list of all pages with free space. In contrast, the list-based organization discussed in the text maintains a list of full pages and a list of pages with free space. 1. What are the trade-oﬀs, if any? Is one of them clearly superior? 2. For each of these organizations, describe a page format that can be used to implement it. Answer 7.20 Answer omitted.
Exercise 7.21 Modern disk drives store more sectors on the outer tracks than the inner tracks. Since the rotation speed is constant, the sequential data transfer rate is also higher on the outer tracks. The seek time and rotational delay are unchanged. Considering this information, explain good strategies for placing ﬁles with the following kinds of access patterns: 1. Frequent, random accesses to a small ﬁle (e.g., catalog relations). 2. Sequential scans of a large ﬁle (e.g., selection from a relation with no index). 3. Random accesses to a large ﬁle via an index (e.g., selection from a relation via the index). 4. Sequential scans of a small ﬁle. Answer 7.21 1. Place the ﬁle in the middle tracks. Sequential speed is not an issue due to the small size of the ﬁle, and the seek time is minimized by placing ﬁles in the center. 2. Place the ﬁle in the outer tracks. Sequential speed is most important and outer tracks maximize it. 3. Place the ﬁle and index on the inner tracks. The DBMS will alternately access pages of the index and of the ﬁle, and so the two should reside in close proximity to reduce seek times. By placing the ﬁle and the index on the inner tracks we also save valuable space on the faster (outer) tracks for other ﬁles that are accessed sequentially. 4. Place small ﬁles in the inner half of the disk. A scan of a small ﬁle is eﬀectively random I/O because the cost is dominated by the cost of the initial seek to the beginning of the ﬁle.
8
FILE ORGANIZATIONS AND INDEXES
Exercise 8.1 What are the main conclusions that you can draw from the discussion of the three ﬁle organizations? Answer 8.1 The main conclusion about the three ﬁle organizations is that all three ﬁle organizations have their own advantages and disadvantages. No one ﬁle organization is uniformly superior in all situations. The choice of appropriate structures for a given data set can have a signiﬁcant impact upon performance. An unordered ﬁle is best if only full ﬁle scans are desired. A hashed ﬁle is best if the most common operation is an equality selection. A sorted ﬁle is best (of the three alternatiaves considered in this chapter) if range selections are desired. Exercise 8.2 Consider a delete speciﬁed using an equality condition. What is the cost if no record qualiﬁes? What is the cost if the condition is not on a key? Answer 8.2 Answer omitted. Exercise 8.3 Which of the three basic ﬁle organizations would you choose for a ﬁle where the most frequent operations are as follows? 1. Search for records based on a range of ﬁeld values. 2. Perform inserts and scans where the order of records does not matter. 3. Search for a record based on a particular ﬁeld value. Answer 8.3 1. Using these ﬁelds as the search key, we would choose a sorted ﬁle organization. 2. Heap ﬁle would be the best ﬁt in this situation. 3. Using this particular ﬁeld as the searach key, choosing a hashed ﬁle would be the best. 72
File Organizations and Indexes
Exercise 8.4 Explain the diﬀerence between each of the following: 1. Primary versus secondary indexes. 2. Dense versus sparse indexes. 3. Clustered versus unclustered indexes.
73
If you were about to create an index on a relation, what considerations would guide your choice with respect to each pair of properties listed above? Answer 8.4 Answer omitted.
Exercise 8.5 Consider a relation stored as a randomly ordered ﬁle for which the only index is an unclustered index on a ﬁeld called sal. If you want to retrieve all records with sal > 20, is using the index always the best alternative? Explain. Answer 8.5 No. In this case, the index is unclustered, each qualifying data entries could contain an rid that points to a distinct data page, leading to as many data page I/Os as the number of data entries that match the range query. At this time,using index is worse than ﬁle scan.
Exercise 8.6 If an index contains data records as ‘data entries’, is it clustered or unclustered? Dense or sparse? Answer 8.6 Answer omitted.
Exercise 8.7 Consider Alternatives (1), (2) and (3) for ‘data entries’ in an index, as discussed in Section 8.3.1. Are they all suitable for secondary indexes? Explain. Answer 8.7 Yes.All the three alternatives allow duplicate data entries.
Exercise 8.8 Consider the instance of the Students relation shown in Figure 8.1, sorted by age: For the purposes of this question, assume that these tuples are stored in a sorted ﬁle in the order shown; the ﬁrst tuple is in page 1, slot 1; the second tuple is in page 1, slot 2; and so on. Each page can store up to three data records. You can use page-id, slot to identify a tuple. List the data entries in each of the following indexes. If the order of entries is signiﬁcant, say so and explain why. If such an index cannot be constructed, say so and explain why. 1. A dense index on age using Alternative (1).
74
Chapter 8
sid 53831 53832 53666 53688 53650
Figure 8.1
name Madayan Guldu Jones Smith Smith
login madayan@music guldu@music jones@cs smith@ee smith@math
age 11 12 18 19 19
gpa 1.8 2.0 3.4 3.2 3.8
An Instance of the Students Relation, Sorted by age
2. A dense index on age using Alternative (2). 3. A dense index on age using Alternative (3). 4. A sparse index on age using Alternative (1). 5. A sparse index on age using Alternative (2). 6. A sparse index on age using Alternative (3). 7. A dense index on gpa using Alternative (1). 8. A dense index on gpa using Alternative (2). 9. A dense index on gpa using Alternative (3). 10. A sparse index on gpa using Alternative (1). 11. A sparse index on gpa using Alternative (2). 12. A sparse index on gpa using Alternative (3). Answer 8.8 Answer omitted.
9
TREE-STRUCTURED INDEXING
Exercise 9.1 Consider the B+ tree index of order d = 2 shown in Figure 9.1. 1. Show the tree that would result from inserting a data entry with key 9 into this tree. 2. Show the B+ tree that would result from inserting a data entry with key 3 into the original tree. How many page reads and page writes will the insertion require? 3. Show the B+ tree that would result from deleting the data entry with key 8 from the original tree, assuming that the left sibling is checked for possible redistribution. 4. Show the B+ tree that would result from deleting the data entry with key 8 from the original tree, assuming that the right sibling is checked for possible redistribution. 5. Show the B+ tree that would result from starting with the original tree, inserting a data entry with key 46 and then deleting the data entry with key 52. 6. Show the B+ tree that would result from deleting the data entry with key 91 from the original tree.
Root
50
8
18
32
40
73
85
1* 2* 5* 6*
8* 10*
18* 27*
32* 39*
41*45*
52* 58*
73* 80*
91* 99*
Figure 9.1
Tree for Exercise 9.1
75
76
Chapter 9
Root
50
8
18
32
40
73
85
1*
2*
5*
6*
8*
9* 10*
18* 27*
32* 39*
41* 45*
52* 58*
73* 80*
91* 99*
Figure 9.2
Root
18
50
5
8
32
40
73
85
1* 2* 3*
5* 6*
8* 10*
18* 27*
32* 39*
41* 45*
52* 58*
73* 80*
91* 99*
Figure 9.3
7. Show the B+ tree that would result from starting with the original tree, inserting a data entry with key 59, and then deleting the data entry with key 91. 8. Show the B+ tree that would result from successively deleting the data entries with keys 32, 39, 41, 45, and 73 from the original tree. Answer 9.1 1. The data entry with key 9 is inserted on the second leaf page. The resulting tree is shown in ﬁgure 9.2. 2. The data entry with key 3 goes on the ﬁrst leaf page F . Since F can accommodate at most four data entries (d = 2), F splits. The lowest data entry of the new leaf is given up to the ancestor which also splits. The result can be seen in ﬁgure 9.3. The insertion will require 6 page writes, 4 page reads and allocation of 2 new pages. 3. The data entry with key 8 is deleted, resulting in a leaf page N with less than two data entries. The left sibling L is checked for redistribution. Since L has more than two data entries, the remaining keys are redistributed between L and N , resulting in the tree in ﬁgure 9.4. 4. As is part 3, the data entry with key 8 is deleted from the leaf page N . N ’s right sibling R is checked for redistribution, but R has the minimum number of keys.
Tree-Structured Indexing
77
Root
50
6
18
32
40
73
85
1*
2*
5*
6* 10*
18* 27*
32* 39*
41* 45*
52* 58*
73* 80*
91* 99*
Figure 9.4
Root
50
8
32
40
73
85
1*
2*
5*
6*
10* 18* 27*
32* 39*
41* 45*
52* 58*
73* 80*
91* 99*
Figure 9.5
Therefore the two siblings merge. The key in the ancestor which distinguished between the newly merged leaves is deleted. The resulting tree is shown in ﬁgure 9.5. 5. The data entry with key 46 can be inserted without any structural changes in the tree. But the removal of the data entry with key 52 causes its leaf page L to merge with a sibling (we chose the right sibling). This results in the removal of a key in the ancestor A of L and thereby lowering the number of keys on A below the minimum number of keys. Since the left sibling B of A has more than the minimum number of keys, redistributon between A and B takes place. The ﬁnal tree is depicted in ﬁgure 9.6. 6. Deleting the data entry with key 91 causes a scenario similar to part 5. The result can be seen in ﬁgure 9.7. 7. The data entry with key 59 can be inserted without any structural changes in the tree. No sibling of the leaf page with the data entry with key 91 is eﬀected by the insert. Therefore deleting the data entry with key 91 changes the tree a way very similar to part 6. The result is depicted in ﬁgure 9.8. 8. The successive deletion of the data entries with keys 32, 39, 41, 45 and 73 results in the tree in ﬁgure 9.9.
78
Chapter 9
Root
40
8
18
32
50
85
1*
2*
5*
6*
8* 10*
18* 27*
32* 39*
41* 45* 46*
58* 73* 80*
91* 99*
Figure 9.6
Root
40
8
18
32
50
73
1*
2*
5*
6*
8* 10*
18* 27*
32* 39*
41* 45*
52* 58*
73* 80* 99*
Figure 9.7
Root
40
8
18
32
50
73
1*
2*
5*
6*
8* 10*
18* 27*
32* 39*
41* 45*
52* 58* 59*
73* 80* 99*
Figure 9.8
Tree-Structured Indexing
79
Root
8
18
50
73
1*
2*
5*
6*
8* 10*
18* 27*
52* 58*
80* 91* 99*
Figure 9.9
10
20
30 I1
80
A
B
C
35
42
50 I2
65
90
98 I3
30* 31* L1 36* 38* L2 42* 43* L3
68* 69* 70* 79* L5 51* 52* 56* 60* L4 81* 82* L6
98* 99* 100* 105* L8 94* 95* 96* 97* L7
Figure 9.10
Tree for Exercise 9.2
Exercise 9.2 Consider the B+ tree index shown in Figure 9.10, which uses Alternative (1) for data entries. Each intermediate node can hold up to ﬁve pointers and four key values. Each leaf can hold up to four records, and leaf nodes are doubly linked as usual, although these links are not shown in the ﬁgure. Answer the following questions. 1. Name all the tree nodes that must be fetched to answer the following query: “Get all records with search key greater than 38.” 2. Insert a record with search key 109 into the tree. 3. Delete the record with search key 81 from the (original) tree. 4. Name a search key value such that inserting it into the (original) tree would cause an increase in the height of the tree.
80
Chapter 9
5. Note that subtrees A, B, and C are not fully speciﬁed. Nonetheless, what can you infer about the contents and the shape of these trees? 6. How would your answers to the above questions change if this were an ISAM index? 7. Suppose that this is an ISAM index. What is the minimum number of insertions needed to create a chain of three overﬂow pages? Answer 9.2 Answer omitted.
Exercise 9.3 Answer the following questions. 1. What is the minimum space utilization for a B+ tree index? 2. What is the minimum space utilization for an ISAM index? 3. If your database system supported both a static and a dynamic tree index (say, ISAM and B+ trees), would you ever consider using the static index in preference to the dynamic index? Answer 9.3 1. By the deﬁnition of a B+ tree, each index page, except for the root, has at least d and at most 2d key entries. Therefore—with the exception of the root—the minimum space utilization guaranteed by a B+ tree index is 50 percent. 2. The minimum space utilization by an ISAM index depends on the design of the index and the data distribution over the lifetime of ISAM index. But since an ISAM index is static, empty spaces in index pages are never ﬁlled (in contrast to a B+ tree index, which is a dynamic index). Therefore the space utilization of ISAM index pages is usually close to 100 percent. However, there is no guarantee for data pages’ utilization. 3. A static index without overﬂow pages is faster than a dynamic index on inserts and deletes, since index pages are only read and never written. If the set of keys that will be inserted into the tree is known in advance, then it is possible to build at the beginning a static index which reserves enough space for all possible future inserts. Also if the system goes periodically oﬀ line, static indices can be rebuilt and scaled to the current occupancy of the index.
Exercise 9.4 Suppose that a page can contain at most four data values and that all data values are integers. Using only B+ trees of order 2, give examples of each of the following: 1. A B+ tree whose height changes from 2 to 3 when the value 25 is inserted. Show your structure before and after the insertion.
Tree-Structured Indexing
Root
81
13
17
24
30
2*
3*
5*
7*
14* 16*
19* 20* 22*
24* 27* 29*
33* 34* 38* 39*
Figure 9.11
Tree for Exercise 9.5
2. A B+ tree in which the deletion of the value 25 leads to a redistribution. Show your structure before and after the deletion. 3. A B+ tree in which the deletion of the value 25 causes a merge of two nodes, but without altering the height of the tree. 4. An ISAM structure with four buckets, none of which has an overﬂow page. Further, every bucket has space for exactly one more entry. Show your structure before and after inserting two additional values, chosen so that an overﬂow page is created. Answer 9.4 Answer omitted.
Exercise 9.5 Consider the B+ tree shown in Figure 9.11. 1. Identify a list of ﬁve data entries such that: (a) Inserting the entries in the order shown and then deleting them in the opposite order (e.g., insert a, insert b, delete b, delete a) results in the original tree. (b) Inserting the entries in the order shown and then deleting them in the opposite order (e.g., insert a, insert b, delete b, delete a) results in a diﬀerent tree. 2. What is the minimum number of insertions of data entries with distinct keys that will cause the height of the (original) tree to change from its current value (of 1) to 3? 3. Would the minimum number of insertions that will cause the original tree to increase to height 3 change if you were allowed to insert duplicates (multiple data entries with the same key), assuming that overﬂow pages are not used for handling duplicates?
82
Chapter 9
Answer 9.5 1. (a) One example is the set of four data entries with keys 13,15,18 and 25. (b) An insertion of the data entry with key 30 and its removal results in a diﬀerent tree. 2. Let us call the current tree depicted in Figure 9.11 T . T has 16 data entries. The smallest tree S of height 3 which is created exclusively through inserts has (1 ∗ 2 ∗ 3 ∗ 3) ∗ 2 + 1 = 37 data entries in its leaf pages. S has 18 leaf pages with two data entries each and one leaf page with three data entries. T has already four leaf pages which have more than two data entries; they can be ﬁlled and made to split, but after each spilt, one of the two pages will still has three data entries remaining. Therefore the smallest tree of height 3 which can possibly be created from T only through inserts has (1 ∗ 2 ∗ 3 ∗ 3) ∗ 2 + 4 = 40 data entries. Therefore the minimum number of entries that will cause the height of T to change to 3 is 40 − 16=24. 3. The argument in part 2 does not assume anything about the data entries to be inserted; it is valid if duplicates can be inserted as well. Therefore the solution does not change.
Exercise 9.6 Answer Exercise 9.5 assuming that the tree is an ISAM tree! (Some of the examples asked for may not exist—if so, explain brieﬂy.) Answer 9.6 Answer omitted.
Exercise 9.7 Suppose that you have a sorted ﬁle, and you want to construct a dense primary B+ tree index on this ﬁle. 1. One way to accomplish this task is to scan the ﬁle, record by record, inserting each one using the B+ tree insertion procedure. What performance and storage utilization problems are there with this approach? 2. Explain how the bulk-loading algorithm described in the text improves upon the above scheme. Answer 9.7 1. This approach is likely to be quite expensive, since each entry requires us to start from the root and go down to the appropriate leaf page. Even though the index level pages are likely to stay in the buﬀer pool between successive requests, the overhead is still considerable. According to insertion algorithm, each time the node is splitted,the data entries will be redistributed evenly to both nodes. It leads to a ﬁxed occupancy rate 50% 2. The bulk loading algorithm has good performance and space utilization comparing with insertion approach. First of all, When a node is full and needed to split,we
Tree-Structured Indexing
83
can just leave all the entries on the old node or ﬁlled up some desired fraction of that node, say 80%. It will lead to high space utilization. Also, we note that splits only occur on the right-most path from the root to the leaf level.These are just a few pages, we can keep them in buﬀer pool.The performance is better than insertion approach.
Exercise 9.8 Assume that you have just built a dense B+ tree index using Alternative (2) on a heap ﬁle containing 20,000 records. The key ﬁeld for this B+ tree index is a 40-byte string, and it is a candidate key. Pointers (i.e., record ids and page ids) are (at most) 10-byte values. The size of one disk page is 1,000 bytes. The index was built in a bottom-up fashion using the bulk-loading algorithm, and the nodes at each level were ﬁlled up as much as possible. 1. How many levels does the resulting tree have? 2. For each level of the tree, how many nodes are at that level? 3. How many levels would the resulting tree have if key compression is used and it reduces the average size of each key in an entry to 10 bytes? 4. How many levels would the resulting tree have without key compression, but with all pages 70 percent full? Answer 9.8 Answer omitted.
Exercise 9.9 The algorithms for insertion and deletion into a B+ tree are presented as recursive algorithms. In the code for insert, for instance, there is a call made at the parent of a node N to insert into (the subtree rooted at) node N, and when this call returns, the current node is the parent of N. Thus, we do not maintain any ‘parent pointers’ in nodes of B+ tree. Such pointers are not part of the B+ tree structure for a good reason, as this exercise will demonstrate. An alternative approach that uses parent pointers—again, remember that such pointers are not part of the standard B+ tree structure!—in each node appears to be simpler: Search to the appropriate leaf using the search algorithm; then insert the entry and split if necessary, with splits propagated to parents if necessary (using the parent pointers to ﬁnd the parents). Consider this (unsatisfactory) alternative approach: 1. Suppose that an internal node N is split into nodes N and N2. What can you say about the parent pointers in the children of the original node N?
84
Chapter 9
2. Suggest two ways of dealing with the inconsistent parent pointers in the children of node N. 3. For each of the above suggestions, identify a potential (major) disadvantage. 4. What conclusions can you draw from this exercise? Answer 9.9 1. The parent pointers in either d or d+1 of the children of the original node N are not any more valid: they still point to N , but they should point to N 2. 2. One solution is to adjust all parent pointers in the children of the original node N which became children of N 2. Another solution is to leave the pointers during the insert operation and to adjust them later. 3. The ﬁrst solution requires at least d+1 additional page reads (and sometime later, page writes) on an insert, which would result in a remarkable slowdown. In the second solution mentioned above, a child M , which has a parent pointer to be adjusted, is updated if an operation is performed which actually reads M into memory (maybe on a down path from the root to a leaf page). But this solution modiﬁes M and therefore requires sometime later a write of M , which might not have been necessary if there were no parent pointers. 4. In conclusion, to add parent pointers to the B+ tree data structure is not a good modiﬁcation. Parent pointers cause unnecessary page updates and so lead to a decrease in performance. Exercise 9.10 Consider the instance of the Students relation shown in Figure 9.12. Show a B+ tree of order 2 in each of these cases, assuming that duplicates are handled using overﬂow pages. Clearly indicate what the data entries are (i.e., do not use the ‘k∗’ convention). 1. A dense B+ tree index on age using Alternative (1) for data entries. 2. A sparse B+ tree index on age using Alternative (1) for data entries. 3. A dense B+ tree index on gpa using Alternative (2) for data entries. For the purposes of this question, assume that these tuples are stored in a sorted ﬁle in the order shown in the ﬁgure: the ﬁrst tuple is in page 1, slot 1; the second tuple is in page 1, slot 2; and so on. Each page can store up to three data records. You can use page-id, slot to identify a tuple. Answer 9.10 Answer omitted. Exercise 9.11 Suppose that duplicates are handled using the approach without overﬂow pages discussed in Section 9.7. Describe an algorithm to search for the left-most occurrence of a data entry with search key value K.
Tree-Structured Indexing
85
sid 53831 53832 53666 53901 53902 53903 53904 53905 53906 53902 53688 53650 54001 54005 54009
name Madayan Guldu Jones Jones Jones Jones Jones Jones Jones Jones Smith Smith Smith Smith Smith
login madayan@music guldu@music jones@cs jones@toy jones@physics jones@english jones@genetics jones@astro jones@chem jones@sanitation smith@ee smith@math smith@ee smith@cs smith@astro
age 11 12 18 18 18 18 18 18 18 18 19 19 19 19 19
gpa 1.8 3.8 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.8 3.2 3.8 3.5 3.8 2.2
Figure 9.12
An Instance of the Students Relation
Answer 9.11 The key to understanding this problem is to observe that when a leaf splits due to inserted duplicates, then of the two resulting leaves, it may happen that the left leaf contains other search key values less than the duplicated search key value. Furthermore, it could happen that least element on the right leaf could be the duplicated value. (This scenario could arise, for example, when the majority of data entries on the original leaf were for search keys of the duplicated value.) The parent index node (assuming the tree is of at least height 2) will have an entry for the duplicated value with a pointer to the rightmost leaf. If this leaf continues to be ﬁlled with entries having the same duplicated key value, it could split again causing another entry with the same key value to be inserted in the parent node. Thus, the same key value could appear many times in the index nodes as well. While searching for entries with a given key value, the search should proceed by using the left-most of the entries on an index page such that the key value is less than or equal to the given key value. Moreover, on reaching the leaf level, it is possible that there are entries with the given key value (call it k) on the page to the left of the current leaf page, unless some entry with a smaller key value is present on this leaf page. Thus, we must scan to the left using the neighbor pointers at the leaf level until we ﬁnd an entry with a key value less than k (or come to the beginning of the leaf pages). Then, we must scan forward along the leaf level until we ﬁnd an entry with a key value greater than k.
86
Chapter 9
Exercise 9.12 Answer Exercise 9.10 assuming that duplicates are handled without using overﬂow pages, using the alternative approach suggested in Section 9.7. Answer 9.12 Answer omitted.
10
HASH-BASED INDEXING
Exercise 10.1 Consider the Extendible Hashing index shown in Figure 10.1. Answer the following questions about this index: 1. What can you say about the last entry that was inserted into the index? 2. What can you say about the last entry that was inserted into the index if you know that there have been no deletions from this index so far? 3. Suppose you are told that there have been no deletions from this index so far. What can you say about the last entry whose insertion into the index caused a split?
3
64 16 Bucket A
3 000 001 010 011 100 101 110 111
2
1 5 21 Bucket B
2
10 Bucket C
2
15 7 51 Bucket D
3
DIRECTORY 4 12 20 36 Bucket A2
Figure 10.1
Figure for Exercise 10.1
87
88
4. Show the index after inserting an entry with hash value 68.
Chapter 10
5. Show the original index after inserting entries with hash values 17 and 69. 6. Show the original index after deleting the entry with hash value 21. (Assume that the full deletion algorithm is used.) 7. Show the original index after deleting the entry with hash value 10. Is a merge triggered by this deletion? If not, explain why. (Assume that the full deletion algorithm is used.) Answer 10.1 1. It could be any one of the data entries in the index. We can always ﬁnd a sequence of insertions and deletions with a particular key value, among the key values shown in the index as the last insertion. For example, consider the data entry 16 and the following sequence: 1 5 21 10 15 7 51 4 12 36 64 8 24 56 16 56D 24D 8D The last insertion is the data entry 16 and it also causes a split. But the sequence of deletions following this insertion cause a merge leading to the index structure shown in Fig 10.1. 2. The last insertion could not have caused a split because the total number of data entries in the buckets A and A2 is 6. If the last entry caused a split the total would have been 5. 3. The last insertion which caused a split cannot be in bucket C. Buckets B and C or C and D could have made a possible bucket-split image combination but the total number of data entries in these combinations is 4 and the absence of deletions demands a sum of atleast 5 data entries for such combinations. Buckets B and D can form a possible bucket-split image combination because they have a total of 6 data entries between themselves. So do A and A2 . But for the B and D to be split images the starting global depth should have been 1. If the starting global depth is 2, then the last insertion causing a split would be in A or A2 . 4. See Fig 10.2. 5. See Fig 10.3. 6. See Fig 10.4. 7. The deletion of the data entry 10 which is the only data entry in bucket C doesn’t trigger a merge because bucket C is a primary page and it is left as a place holder. Right now, directory element 010 and its split image 110 already point to the same bucket C. We can’t do a further merge. See Fig 10.5. Exercise 10.2 Consider the Linear Hashing index shown in Figure 10.6. Assume that we split whenever an overﬂow page is created. Answer the following questions about this index:
Hash-Based Indexing
89
4 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
3 64 16 BUCKET A
2 1 5 21 BUCKET B
2 10 BUCKET C
2 15 7 51 BUCKET D
4 4 20 36 68 BUCKET A2
4 12 BUCKET A3
DIRECTORY
Figure 10.2
90
Chapter 10
3 000 001 010 011 100 101 110 111
3 64 16 BUCKET A
3 1 17 BUCKET B
2 10 BUCKET C
2 DIRECTORY 15 7 51 BUCKET D
3 4 12 20 36 BUCKET A2
3 5 21 69 BUCKET B2
Figure 10.3
Hash-Based Indexing
91
3 000 001 010 011 100 101 110 111
3 64 16 BUCKET A
2 1 5 BUCKET B
2 10 BUCKET C
2 DIRECTORY 15 7 51 BUCKET D
3 4 12 20 36 BUCKET A2
Figure 10.4
3 000 001 010 011 100 101 110 111
3 64 16 BUCKET A
2 1 5 21 BUCKET B
2 BUCKET C
2 DIRECTORY 15 7 51 BUCKET D
3 4 12 20 36 BUCKET A2
Figure 10.5
92
Level=0
PRIMARY
Chapter 10
h(1)
h(0)
OVERFLOW PAGES
PAGES
000
00 Next=1
32
8 24
001
01
9
25 41 17
010
10
14 18 10 30
011
11
31 35 7
11
100
00
44 36
Figure 10.6
Figure for Exercise 10.2
1. What can you say about the last entry that was inserted into the index? 2. What can you say about the last entry that was inserted into the index if you know that there have been no deletions from this index so far? 3. Suppose you know that there have been no deletions from this index so far. What can you say about the last entry whose insertion into the index caused a split? 4. Show the index after inserting an entry with hash value 4. 5. Show the original index after inserting an entry with hash value 15. 6. Show the original index after deleting the entries with hash values 36 and 44. (Assume that the full deletion algorithm is used.) 7. Find a list of entries whose insertion into the original index would lead to a bucket with two overﬂow pages. Use as few entries as possible to accomplish this. What is the maximum number of entries that can be inserted into this bucket before a split occurs that reduces the length of this overﬂow chain? Answer 10.2 Answer omitted.
Exercise 10.3 Answer the following questions about Extendible Hashing: 1. Explain why local depth and global depth are needed.
Hash-Based Indexing
93
2. After an insertion that causes the directory size to double, how many buckets have exactly one directory entry pointing to them? If an entry is then deleted from one of these buckets, what happens to the directory size? Explain your answers brieﬂy. 3. Does Extendible Hashing guarantee at most one disk access to retrieve a record with a given key value? 4. If the hash function distributes data entries over the space of bucket numbers in a very skewed (non-uniform) way, what can you say about the size of the directory? What can you say about the space utilization in data pages (i.e., non-directory pages)? 5. Does doubling the directory require us to examine all buckets with local depth equal to global depth? 6. Why is handling duplicate key values in Extendible Hashing harder than in ISAM? Answer 10.3 1. Extendible hashing tries to allow the increase and decrease the size of the directory. Once the directory size changes the hash function applied to the search key value should also change. So there should be some information in the index as to which hash function is to be applied. This information is provided by the global depth. An increase in the directory size doesn’t cause the creation of new buckets for each new directory entry. All the new directory entries except one share buckets with the old directory entries. So whenever a bucket which is being shared by two or more directory entries is to be split the directory size need not be doubled. So for each bucket we need to know whether it is being shared by two or more directory entries quickly. This information is provided by the local depth of the bucket. The same information can be got by a scan of the directory which is obviously costlier. 2. Exactly two directory entries have only one directory entry pointing to them after a doubling of the directory size. Whether a merge occurs on this deletion depends on the deletion algorithm. If we try to merge two buckets only when a bucket becomes empty, then it is not necessary that the directory size decrease after the deletion that was considered in the question. But if we try to merge two buckets whenever it is possible to do so then the directory size decreases after the deletion. 3. No such guarantee is provided by extendible hashing. If the directory is not already in memory it needs to be fetched from the disk which may require more than one disk access depending upon the size of the directory. Then the required bucket has to be brought into the memory. Also, if alternatives 2 and 3 are followed for storing the data entries in the index then another disk access is possibly required for fetching the actual data record.
94
Chapter 10
4. Consider the index in Fig 10.1. Let us consider a list of data entries with search key values of the form 2i where i > k. By an appropriate choice of k, we can get all these elements mapped into the Bucket A. Well, we see, there are 2k elements in the directory. But they just point to k + 3 diﬀerent buckets. Also, we note there are k buckets (data pages), however, just one bucket is used. So the utilization of data pages = 1/k 5. No. We just need to examine just one bucket. 6. Extendible hashing is not supposed to have overﬂow pages. Even if it is allowed to have overﬂow pages, whenever we create an overﬂow page we also split the bucket resulting in one empty bucket. Also we try to redistribute the elements in the bucket that we split when all of them are duplicates. It costs more to handle insertions in Extendible hashing than in ISAM. The number of accesses for retrieving a record increases because of the overﬂow chains in both.
Exercise 10.4 Answer the following questions about Linear Hashing. 1. How does Linear Hashing provide an average-case search cost of only slightly more than one disk I/O, given that overﬂow buckets are part of its data structure? 2. Does Linear Hashing guarantee at most one disk access to retrieve a record with a given key value? 3. If a Linear Hashing index using Alternative (1) for data entries contains N records, with P records per page and an average storage utilization of 80 percent, what is the worst-case cost for an equality search? Under what conditions would this cost be the actual search cost? 4. If the hash function distributes data entries over the space of bucket numbers in a very skewed (non-uniform) way, what can you say about the space utilization in data pages? Answer 10.4 Answer omitted.
Exercise 10.5 Give an example of when you would use each element (A or B) for each of the following ‘A versus B’ pairs: 1. A hashed index using Alternative (1) versus heap ﬁle organization. 2. Extendible Hashing versus Linear Hashing. 3. Static Hashing versus Linear Hashing. 4. Static Hashing versus ISAM. 5. Linear Hashing versus B+ trees.
Hash-Based Indexing
95
Answer 10.5 1. Example 1: Consider a situation in which most of the queries are equality queries based on the search key ﬁeld. It pays to build a hashed index on this ﬁeld in which case we can get the required record in one or two disk accesses. A heap ﬁle organisation may require a full scan of the ﬁle to access a particular record. Example 2: A ﬁle on which only sequential scans are done may fare better if it is organised as a heap ﬁle. A hashed index built on it may require more disk accesses because the occupancy of the pages may not be 100%. 2. Example 1: A set of data entries with search keys which lead to a skewed distribution of hash key values. In this case, extendible hashing causes splits of buckets at the necessary bucket whereas linear hashing goes about splitting buckets in a round-robin fashion which is useless. Here extendible hashing has a better occupancy and shorter overﬂow chains than linear hashing. So equality search is cheaper for extendible hashing. Example 2: A very large ﬁle which requires a directory spanning several pages. In this case extendible hashing requires d + 1 disk accesses for equality selections where d is the number of directory pages. Linear hashing is cheaper. 3. Example 1: Consider a situation in which the number of records in the ﬁle is constant. Let all the search key values be of the form 2n + k for various values of n and a few values of k. The traditional hash functions used in linear hashing like taking the last d bits of the search key lead to a skewed distribution of the hash key values. This leads to long overﬂow chains. A static hashing index can use the hash function deﬁned as h(2n + k) = n A family of hash functions can’t be built based on this hash function as k takes only a few values. Example 2: Consider a situation in which the number of records in the ﬁle varies a lot and the hash key values have a uniform distribution. Here linear hashing is clearly better than static hashing which might lead to long overﬂow chains thus considerably increasing the cost of equality search. 4. Example 1: A situation in which the number of records in the ﬁle is constant and only equality selections are performed. Static hashing requires one or two disk accesses to get to the data entry. ISAM may require more than one depending on the height of the ISAM tree. Example 2: Consider a situation in which the search key values of data entries can be used to build a clustered index and most of the queries are range queries on this ﬁeld. Then ISAM deﬁnitely wins over static hashing. 5. Example 1: Again a situation in which only equality selections are performed on the index. Then linear hashing is better than B+ tree. Example 2: An index which is clustered and most of the queries are range searches.
96
Exercise 10.6 Give examples of the following:
Chapter 10
1. A Linear Hashing index and an Extendible Hashing index with the same data entries, such that the Linear Hashing index has more pages. 2. A Linear Hashing index and an Extendible Hashing index with the same data entries, such that the Extendible Hashing index has more pages. Answer 10.6 Answer omitted.
Exercise 10.7 Consider a relation R(a, b, c, d) containing 1,000,000 records, where each page of the relation holds 10 records. R is organized as a heap ﬁle with dense secondary indexes, and the records in R are randomly ordered. Assume that attribute a is a candidate key for R, with values lying in the range 0 to 999,999. For each of the following queries, name the approach that would most likely require the fewest I/Os for processing the query. The approaches to consider follow: Scanning through the whole heap ﬁle for R. Using a B+ tree index on attribute R.a. Using a hash index on attribute R.a. The queries are: 1. Find all R tuples. 2. Find all R tuples such that a < 50. 3. Find all R tuples such that a = 50. 4. Find all R tuples such that a > 50 and a < 100. Answer 10.7 Let h be the height of the B+ tree (usually 2 or 3 ) and M be the number of data entries per page (M > 10). Let us assume that after accessing the data entry it takes one more disk access to get the actual record. Let c be the occupancy factor in hash indexing. Consider the table shown below: Heap File 105 105 105 105 B+ Tree 6 h + 10 + 106 M 50 h + M + 50 h+1 50 h + M + 49 Hash Index 106 6 cM + 10 100 2 98
Hash-Based Indexing
Level=0, N=4 h1 h0 Next=0 000 00
64 44
97
PRIMARY PAGES
001
01 10
9
25
5
010
10
011
11
31 15
7
3
Figure 10.7
Figure for Exercise 10.9
1. Seeing the ﬁrst row of the table, heap ﬁle organisation is the best. 2. From the second row, with typical values for h and M the B+ Tree is the best. 3. From the third row, hash indexing is the best. 4. From the fourth row, again we get that B+ Tree is the best.
Exercise 10.8 How would your answers to Exercise 10.7 change if attribute a is not a candidate key for R? How would they change if we assume that records in R are sorted on a? Answer 10.8 Answer omitted.
Exercise 10.9 Consider the snapshot of the Linear Hashing index shown in Figure 10.7. Assume that a bucket split occurs whenever an overﬂow page is created. 1. What is the maximum number of data entries that can be inserted (given the best possible distribution of keys) before you have to split a bucket? Explain very brieﬂy. 2. Show the ﬁle after inserting a single record whose insertion causes a bucket split. 3. (a) What is the minimum number of record insertions that will cause a split of all four buckets? Explain very brieﬂy. (b) What is the value of Next after making these insertions? (c) What can you say about the number of pages in the fourth bucket shown after this series of record insertions?
98
Level = 1 , N = 4 h1 000 h0
Chapter 10
00
64 Next = 1
001
01
9
25
5
010
10
10
011
11
31
15
7
3
63
100
00
44
Figure 10.8
Answer 10.9 1. The maximum number of entries that can be inserted without causing a split is 6 because there is space for a total of 6 records in all the pages. A split is caused whenever an entry is inserted into a full page. 2. See Fig 10.8 3. (a) Consider the list of insertions 63, 41, 73, 137 followed by 4 more entries which go into the same bucket, say 18, 34, 66, 130 which go into the 3rd bucket. The insertion of 63 causes the ﬁrst bucket to be split. Insertion of 41, 63 causes the second bucket split leaving a full second bucket. Inserting 73 into it causes 3rd bucket-split. At this point atleast 4 more entries are required to split the 4th bucket. A minimum of 8 entries are required to cause the 4 splits. (b) Since all four buckets would have been split, that particular round comes to an end and the next round begins. So N ext = 0 again.
Hash-Based Indexing
99
(c) There can be either one data page or two data pages in the fourth bucket after these insertions. If the 4 more elements inserted into the 2nd bucket after 3rd bucket-spliting, then 4th bucket has 1 data page. If the new 4 more elements inserted into the 4th bucket after 4th bucketspliting and all of them have 011 as its last three bits, then 4th bucket has 2 data pages. Otherwise, if not all have 011 as its last three bits,then 4th bucket has 1 data page.
Exercise 10.10 Consider the data entries in the Linear Hashing index for Exercise 10.9. 1. Show an Extendible Hashing index with the same data entries. 2. Answer the questions in Exercise 10.9 with respect to this index. Answer 10.10 Answer omitted.
Exercise 10.11 In answering the following questions, assume that the full deletion algorithm is used. Assume that merging is done when a bucket becomes empty. 1. Give an example of an Extendible Hashing index in which deleting an entry reduces the global depth. 2. Give an example of a Linear Hashing index in which deleting an entry causes Next to be decremented but leaves Level unchanged. Show the ﬁle before and after the entry is deleted. 3. Give an example of a Linear Hashing index in which deleting an entry causes Level to be decremented. Show the ﬁle before and after the entry is deleted. 4. Give an example of an Extendible Hashing index and a list of entries e1 , e2 , e3 such that inserting the entries in order leads to three splits and deleting them in the reverse order yields the original index. If such an example does not exist, explain. 5. Give an example of a Linear Hashing index and a list of entries e1 , e2 , e3 such that inserting the entries in order leads to three splits and deleting them in the reverse order yields the original index. If such an example does not exist, explain. Answer 10.11 The answers are as follows.
1. See Fig 10.9 2. See Fig 10.10
100
3 000 001 010 011 100 101 110 111 3 DIRECTORY 3 11 27 19 D Delete 63 3 63 D2 2 10 C 2 9 25 5 B 2 64 44 A 00 01 10 11 2
Chapter 10
2 64 44 A
2 9 25 5 B
2 10 C
2 3 11 27 19 D
Figure 10.9
3. See Fig 10.11 4. Let us take the transition shown in Fig 10.12. Here we insert the data entries 4, 5 and 7. Each one of these insertions causes a split with the initial split also causing a directory split. But none of these insertions redistribute the already existing data entries into the new buckets. So when we delete these data entries in the reverse order (actually the order doesn’t matter) and follow the full deletion algorithm we get back the original index. 5. This example is shown in Fig 10.13. Here the idea is similar to that used in the previous answer except that the bucket being split is the one into which the insertion being made. So bucket 2 has to be split and not bucket 3. Also the order of deletions should be exactly reversed because in the deletion algorithm Next is decremented only if the last bucket becomes empty.
Hash-Based Indexing
Level = 1 , N = 4 h1 000 h0
101
00
64 Next = 1
001
01
9
25
5
010
10
10
011
11
31
15
7
3
63
100
00
44 Delete 44
Level = 1 , N = 4 h1 000 h0 Next = 0 00 64
001
01
9
25
5
010
10
10
011
11
31
15
7
3
63
Figure 10.10
102
Chapter 10
Level = 2 , h2 h1 Next = 0 000 00 64
N=4 h1 4 8 16 00
Level = 1 , N = 2 h0 0 64 4 8 16
Next = 1 001 01 9 25 5 19 01 1 9 25 5 19
010
10
10
10
0
10
011
11
31
Delete 31
Figure 10.11
Hash-Based Indexing
103
3 2 00 01 10 11 2 9 25 41 73 B 2 64 32 8 16 A 000 001 010 011 100 101 2 10 C 110 111
3 64 32 8 16
3 9 25 41 73
2 10
3 2 11 19 35 3 D 3 4 Insert 4 , 5, 7. DIRECTORY 11 19 35 3
3 5
3 7
Figure 10.12
104
Chapter 10
h 1 000
h 0 00
h 1 Next = 0 64 32 8 16 A 000
h 0 00 64 32 8 16
001 001 01 9 25 41 73 B
01
9
25
41
73
010 010 10 10 18 34 66 C
10
10
18
34
66
Next = 3 011 011 11 11 19 D 11 11 19
100
00
4
Insert 4, 5, 6 101 01 5
110
10
6
Figure 10.13
11
EXTERNAL SORTING
Exercise 11.1 Suppose that you have a ﬁle with 10,000 pages and that you have three buﬀer pages. Answer the following questions for each of these scenarios, assuming that our most general external sorting algorithm is used: (a) A ﬁle with 10,000 pages and three available buﬀer pages. (b) A ﬁle with 20,000 pages and ﬁve available buﬀer pages. (c) A ﬁle with 2,000,000 pages and 17 available buﬀer pages.
1. How many runs will you produce in the ﬁrst pass? 2. How many passes will it take to sort the ﬁle completely? 3. What is the total I/O cost of sorting the ﬁle? 4. How many buﬀer pages do you need to sort the ﬁle completely in just two passes? Answer 11.1 1. In the ﬁrst pass (Pass 0), N/B runs of B pages each are produced, where N is the number of ﬁle pages and B is the number of available buﬀer pages: (a) 10000/3 = 3334 sorted runs. (b) 20000/5 = 4000 sorted runs. (c) 2000000/17 = 117648 sorted runs. 2. The number of passes required to sort the ﬁle completely, including the initial N/B is the number of runs sorting pass, is logB−1 N 1 + 1, where N 1 = produced by Pass 0: (a) log3334 / log2 + 1 = 13 passes. (b) log4000 / log4 + 1 = 7 passes. (c) log117648 / log16 + 1 = 6 passes.
105
106
Chapter 11
3. Since each page is read and written once per pass, the total number of page I/Os for sorting the ﬁle is 2 ∗ N ∗ (#passes): (a) 2*10000*13 = 260000. (b) 2*20000*7 = 280000. (c) 2*2000000*6 = 24000000. 4. In Pass 0, N/B runs are produced. In Pass 1, we must be able to merge this many runs; i.e., B − 1 ≥ N/B . This implies that B must at least be large enough to satisfy B ∗ (B − 1) ≥ N ; this can be used to guess at B, and the guess must be validated by checking the ﬁrst inequality. Thus: (a) With 10000 pages in the ﬁle, B = 101 satisﬁes both inequalities, B = 100 does not, so we need 101 buﬀer pages. (b) With 20000 pages in the ﬁle, B = 142 satisﬁes both inequalities, B = 141 does not, so we need 142 buﬀer pages. (c) With 2000000 pages in the ﬁle, B = 1415 satisﬁes both inequalities, B = 1414 does not, so we need 1415 buﬀer pages. Exercise 11.2 Answer Exercise 11.1 assuming that a two-way external sort is used. Answer 11.2 Answer omitted. Exercise 11.3 Suppose that you just ﬁnished inserting several records into a heap ﬁle, and now you want to sort those records. Assume that the DBMS uses external sort and makes eﬃcient use of the available buﬀer space when it sorts a ﬁle. Here is some potentially useful information about the newly loaded ﬁle and the DBMS software that is available to operate on it: The number of records in the ﬁle is 4,500. The sort key for the ﬁle is four bytes long. You can assume that rids are eight bytes long and page ids are four bytes long. Each record is a total of 48 bytes long. The page size is 512 bytes. Each page has 12 bytes of control information on it. Four buﬀer pages are available. 1. How many sorted subﬁles will there be after the initial pass of the sort, and how long will each subﬁle be? 2. How many passes (including the initial pass considered above) will be required to sort this ﬁle? 3. What will be the total I/O cost for sorting this ﬁle? 4. What is the largest ﬁle, in terms of the number of records, that you can sort with just four buﬀer pages in two passes? How would your answer change if you had 257 buﬀer pages?
External Sorting
107
5. Suppose that you have a B+ tree index with the search key being the same as the desired sort key. Find the cost of using the index to retrieve the records in sorted order for each of the following cases: The index uses Alternative (1) for data entries. The index uses Alternative (2) and is not clustered. (You can compute the worst-case cost in this case.) How would the costs of using the index change if the ﬁle is the largest that you can sort in two passes of external sort with 257 buﬀer pages? Give your answer for both clustered and unclustered indexes. Answer 11.3 1. Assuming that the general external merge-sort algorithm is used, and that the available space for storing records in each page is 512 − 12 = 500 bytes, each page can store up to 10 records of 48 bytes each. So 450 pages are needed in order to store all 4500 records, assuming that a record is not allowed to span more than one page. Given that 4 buﬀer pages are available, there will be 450/4 = 113 sorted runs (sub-ﬁles) of 4 pages each, except the last run, which is only 2 pages long. 2. The total number of passes will be equal to log3 113 + 1 = 6 passes. 3. The total I/O cost for sorting this ﬁle is 2 ∗ 450 ∗ 6 = 5400 I/Os. 4. As we saw in the previous exercise, in Pass 0, N/B runs are produced. In Pass 1, we must be able to merge this many runs; i.e., B − 1 ≥ N/B . When B is given to be 4, we get N = 12. The maximum number of records on 12 pages is 12 ∗ 10 = 120. When B = 257, we get N = 65792, and the number of records is 65792 ∗ 10 = 657920. 5. (a) If the index uses Alternative (1) for data entries, and it is clustered, the cost will be equal to the cost of traversing the tree from the root to the leftmost leaf plus the cost of retrieving the pages in the sequence set. Assuming 67% occupancy, the number of leaf pages in the tree (the sequence set) is 450/0.67 = 600. (b) If the index uses Alternative (2), and is not clustered, in the worst case, ﬁrst we scan B+ tree’s leaf pages, also each data entry will require fetching a data page. The number of data entries is equal to the number of data records, which is 4500. Since there is one data entry per record, each data entry requires 12 bytes, and each page holds 512 bytes, the number of B+ tree leaf pages is about (4500 ∗ 12)/(512 ∗ 0.67)), assuming 67% occupancy, which is about 150. Thus, about 4650 I/Os are required in a worst-case scenario. (c) The B+ tree in this case has 65792/0.67 = 98197 leaf pages if Alternative (1) is used, assuming 67% occupancy. This is the number of I/Os required (plus the relatively minor cost of going from the root to the left-most leaf).
108
Chapter 11
If Alternative (2) is used, and the index is not clustered, the number of I/Os is approximately equal to the number of data entries in the worst case, that is 657920,plus the number of B+ tree leaf pages 2224. Thus, number of I/Os is 660144.
Exercise 11.4 Consider a disk with an average seek time of 10ms, average rotational delay of 5ms, and a transfer time of 1ms for a 4K page. Assume that the cost of reading/writing a page is the sum of these values (i.e., 16ms) unless a sequence of pages is read/written. In this case the cost is the average seek time plus the average rotational delay (to ﬁnd the ﬁrst page in the sequence) plus 1ms per page (to transfer data). You are given 320 buﬀer pages and asked to sort a ﬁle with 10,000,000 pages. 1. Why is it a bad idea to use the 320 pages to support virtual memory, that is, to ‘new’ 10,000,000*4K bytes of memory, and to use an in-memory sorting algorithm such as Quicksort? 2. Assume that you begin by creating sorted runs of 320 pages each in the ﬁrst pass. Evaluate the cost of the following approaches for the subsequent merging passes: (a) Do 319-way merges. (b) Create 256 ‘input’ buﬀers of 1 page each, create an ‘output’ buﬀer of 64 pages, and do 256-way merges. (c) Create 16 ‘input’ buﬀers of 16 pages each, create an ‘output’ buﬀer of 64 pages, and do 16-way merges. (d) Create eight ‘input’ buﬀers of 32 pages each, create an ‘output’ buﬀer of 64 pages, and do eight-way merges. (e) Create four ‘input’ buﬀers of 64 pages each, create an ‘output’ buﬀer of 64 pages, and do four-way merges. Answer 11.4 Answer omitted.
Exercise 11.5 Consider the reﬁnement to the external sort algorithm that produces runs of length 2B on average, where B is the number of buﬀer pages. This reﬁnement was described in Section 11.2.1 under the assumption that all records are the same size. Explain why this assumption is required and extend the idea to cover the case of variable length records. Answer 11.5 Answer not available.
12
EVALUATION OF RELATIONAL OPERATORS
Exercise 12.1 Brieﬂy answer the following questions: 1. Consider the three basic techniques, iteration, indexing, and partitioning, and the relational algebra operators selection, projection, and join. For each technique– operator pair, describe an algorithm based on the technique for evaluating the operator. 2. Deﬁne the term most selective access path for a query. 3. Describe conjunctive normal form, and explain why it is important in the context of relational query evaluation. 4. When does a general selection condition match an index? What is a primary term in a selection condition with respect to a given index? 5. How does hybrid hash join improve upon the basic hash join algorithm? 6. Discuss the pros and cons of hash join, sort-merge join, and block nested loops join. 7. If the join condition is not equality, can you use sort-merge join? Can you use hash join? Can you use index nested loops join? Can you use block nested loops join? 8. Describe how to evaluate a grouping query with aggregation operator MAX using a sorting-based approach. 9. Suppose that you are building a DBMS and want to add a new aggregate operator called SECOND LARGEST, which is a variation of the MAX operator. Describe how you would implement it. 10. Give an example of how buﬀer replacement policies can aﬀect the performance of a join algorithm.
109
110
Answer 12.1 Answer not available.
Chapter 12
Exercise 12.2 Consider a relation R(a,b,c,d,e) containing 5,000,000 records, where each data page of the relation holds 10 records. R is organized as a sorted ﬁle with dense secondary indexes. Assume that R.a is a candidate key for R, with values lying in the range 0 to 4,999,999, and that R is stored in R.a order. For each of the following relational algebra queries, state which of the following three approaches is most likely to be the cheapest: Access the sorted ﬁle for R directly. Use a (clustered) B+ tree index on attribute R.a. Use a linear hashed index on attribute R.a. 1. σa<50,000 (R) 2. σa=50,000 (R) 3. σa>50,000∧a<50,010 (R) 4. σa=50,000 (R) Answer 12.2 Answer omitted.
Exercise 12.3 Consider processing the following SQL projection query: SELECT DISTINCT E.title, E.ename FROM Executives E You are given the following information: Executives has attributes ename, title, dname, and address; all are string ﬁelds of the same length. The ename attribute is a candidate key. The relation contains 10,000 pages. There are 10 buﬀer pages. Consider the optimized version of the sorting-based projection algorithm: The initial sorting pass reads the input relation and creates sorted runs of tuples containing only attributes ename and title. Subsequent merging passes eliminate duplicates while merging the initial runs to obtain a single sorted result (as opposed to doing a separate pass to eliminate duplicates from a sorted result containing duplicates).
Evaluation of Relational Operators
111
1. How many sorted runs are produced in the ﬁrst pass? What is the average length of these runs? (Assume that memory is utilized well and that any available optimization to increase run size is used.) What is the I/O cost of this sorting pass? 2. How many additional merge passes will be required to compute the ﬁnal result of the projection query? What is the I/O cost of these additional passes? 3. (a) Suppose that a clustered B+ tree index on title is available. Is this index likely to oﬀer a cheaper alternative to sorting? Would your answer change if the index were unclustered? Would your answer change if the index were a hash index? (b) Suppose that a clustered B+ tree index on ename is available. Is this index likely to oﬀer a cheaper alternative to sorting? Would your answer change if the index were unclustered? Would your answer change if the index were a hash index? (c) Suppose that a clustered B+ tree index on ename, title is available. Is this index likely to oﬀer a cheaper alternative to sorting? Would your answer change if the index were unclustered? Would your answer change if the index were a hash index? 4. Suppose that the query is as follows: SELECT E.title, E.ename FROM Executives E That is, you are not required to do duplicate elimination. How would your answers to the previous questions change? Answer 12.3 1. The ﬁrst pass will produce 250 sorted runs of 20 pages each, costing 15000 I/Os. 2. Using the ten buﬀer pages provided, on average we can write 2*10 internally sorted pages per pass, instead of 10. Then, three more passes are required to merge the 5000/20 runs, costing 2*3*5000 = 30000 I/Os. 3. (a) Using a clustered B+ tree index on title would reduce the cost to single scan, or 12,500 I/Os. An unclustered index could potentially cost more than 2500+100,000 (2500 from scanning the B+ tree, and 10000 * tuples per page, which I just assumed to be 10). Thus, an unclustered index would not be cheaper. Whether or not to use a hash index would depend on whether the index is clustered. If so, the hash index would probably be cheaper. (b) Using the clustered B+ tree on ename would be cheaper than sorting, in that the cost of using the B+ tree would be 12,500 I/Os. Since ename is a candidate key, no duplicate checking need be done for < title, ename > pairs. An unclustered index would require 2500 (scan of index) + 10000 * tuples per page I/Os and thus probably be more expensive than sorting.
112
Chapter 12
(c) Using a clustered B+ tree index on < ename, title > would also be more cost-eﬀective than sorting. An unclustered B+ tree over the same attributes would allow an index-only scan, and would thus be just as economical as the clustered index. This method (both by clustered and unclustered ) would cost around 5000 I/O’s.
4. Knowing that duplicate elimination is not required, we can simply scan the relation and discard unwanted ﬁelds for each tuple. This is the best strategy except in the case that an index (clustered or unclustered) on < ename, title > is available; in this case, we can do an index-only scan. (Note that even with DISTINCT speciﬁed, no duplicates are actually present int he answer because ename is a candidate key. However, a typical optimizer is not likely to recognize this and omit the duplicate elimination step.)
Exercise 12.4 Consider the join R R.a=S.bS, given the following information about the relations to be joined. The cost metric is the number of page I/Os unless otherwise noted, and the cost of writing out the result should be uniformly ignored. Relation R contains 10,000 tuples and has 10 tuples per page. Relation S contains 2,000 tuples and also has 10 tuples per page. Attribute b of relation S is the primary key for S. Both relations are stored as simple heap ﬁles. Neither relation has any indexes built on it. 52 buﬀer pages are available.
1. What is the cost of joining R and S using a page-oriented simple nested loops join? What is the minimum number of buﬀer pages required for this cost to remain unchanged? 2. What is the cost of joining R and S using a block nested loops join? What is the minimum number of buﬀer pages required for this cost to remain unchanged? 3. What is the cost of joining R and S using a sort-merge join? What is the minimum number of buﬀer pages required for this cost to remain unchanged? 4. What is the cost of joining R and S using a hash join? What is the minimum number of buﬀer pages required for this cost to remain unchanged? 5. What would be the lowest possible I/O cost for joining R and S using any join algorithm, and how much buﬀer space would be needed to achieve this cost? Explain brieﬂy. 6. How many tuples will the join of R and S produce, at most, and how many pages would be required to store the result of the join back on disk?
Evaluation of Relational Operators
113
7. Would your answers to any of the previous questions in this exercise change if you are told that R.a is a foreign key that refers to S.b? Answer 12.4 Answer omitted.
Exercise 12.5 Consider the join of R and S described in Exercise 12.1. 1. With 52 buﬀer pages, if unclustered B+ indexes existed on R.a and S.b, would either provide a cheaper alternative for performing the join (using an index nested loops join) than a block nested loops join? Explain. (a) Would your answer change if only ﬁve buﬀer pages were available? (b) Would your answer change if S contained only 10 tuples instead of 2,000 tuples? 2. With 52 buﬀer pages, if clustered B+ indexes existed on R.a and S.b, would either provide a cheaper alternative for performing the join (using the index nested loops algorithm) than a block nested loops join? Explain. (a) Would your answer change if only ﬁve buﬀer pages were available? (b) Would your answer change if S contained only 10 tuples instead of 2,000 tuples? 3. If only 15 buﬀers were available, what would be the cost of a sort-merge join? What would be the cost of a hash join? 4. If the size of S were increased to also be 10,000 tuples, but only 15 buﬀer pages were available, what would be the cost of a sort-merge join? What would be the cost of a hash join? 5. If the size of S were increased to also be 10,000 tuples, and 52 buﬀer pages were available, what would be the cost of sort-merge join? What would be the cost of hash join? Answer 12.5 Assume that it takes 3 I/Os to access a leaf in R, and 2 I/Os to access a leaf in S. And since S.b is a primary key, we will assume that every tuple in S matches 5 tuples in R. 1. The Index Nested Loops join involves probing an index on the inner relation for each tuple in the outer relation. The cost of the probe is the cost of accessing a leaf page plus the cost of retrieving any matching data records. The cost of retrieving data records could be as high as one I/O per record for an unclustered index.
114
Chapter 12
With R as the outer relation, the cost of the Index Nested Loops join will be the cost of reading R plus the cost of 10,000 probes on S. T otalCost = 1, 000 + 10, 000 ∗ (2 + 1) = 31, 000 With S as the outer relation, the cost of the Index Nested Loops join will be the cost of reading S plus the cost of 2000 probes on R. T otalCost = 200 + 2, 000 ∗ (3 + 5) = 16, 200 Neither of these solutions is cheaper than Block Nested Loops join which required 4,200 I/Os. (a) With 5 buﬀer pages, the cost of the Index Nested Loops joins remains the same, but the cost of the Block Nested Loops join will increase. The new cost of the Block Nested Loops join is T otalCost = N + M ∗ N = 67, 200 B−2
And now the cheapest solution is the Index Nested Loops join with S as the outer relation. (b) If S contains 10 tuples then we’ll need to change some of our initial assumptions. Now all of the S tuples ﬁt on a single page, and it will only require a single I/O to access the (single) leaf in the index. Also, each tuple in S will match 1,000 tuples in R. Block Nested Loops: T otalCost = N + M ∗ N = 1, 001 B−2
Index Nested Loops with R as the outer relation: T otalCost = 1, 000 + 10, 000 ∗ (1 + 1) = 21, 000 Index Nested Loops with S as the outer relation: T otalCost = 1 + 10 ∗ (3 + 1, 000) = 10, 031 Block Nested Loops is still the best solution. 2. With a clustered index the cost of accessing data records becomes one I/O for every 10 data records. With R as the outer relation, the cost of the Index Nested Loops join will be the cost of reading R plus the cost of 10,000 probes on S. T otalCost = 1, 000 + 10, 000 ∗ (2 + 1) = 31, 000
Evaluation of Relational Operators
115
With S as the outer relation, the cost of the Index Nested Loops join will be the cost of reading S plus the cost of 2000 probes on R. T otalCost = 200 + 2, 000 ∗ (3 + 1) = 8, 200 Neither of these solutions is cheaper than Block Nested Loops join which required 4,200 I/Os. (a) With 5 buﬀer pages, the cost of the Index Nested Loops joins remains the same, but the cost of the Block Nested Loops join will increase. The new cost of the Block Nested Loops join is T otalCost = N + M ∗ N = 67, 200 B−2
And now the cheapest solution is the Index Nested Loops join with S as the outer relation. (b) If S contains 10 tuples then we’ll need to change some of our initial assumptions. Now all of the S tuples ﬁt on a single page, and it will only require a single I/O to access the (single) leaf in the index. Also, each tuple in S will match 1,000 tuples in R. Block Nested Loops: T otalCost = N + M ∗ N = 1, 001 B−2
Index Nested Loops with R as the outer relation: T otalCost = 1, 000 + 10, 000 ∗ (1 + 1) = 21, 000 Index Nested Loops with S as the outer relation: T otalCost = 1 + 10 ∗ (3 + 100) = 1, 031 Block Nested Loops is still the best solution. 3. SORT-MERGE: With 15 buﬀer pages we can sort R in three passes and S in two passes. The cost of sorting R is 2 ∗ 3 ∗ M = 6, 000, the cost of sorting S is 2 ∗ 2 ∗ N = 800, and the cost of the merging phase is M + N = 1, 200. T otalCost = 6, 000 + 800 + 1, 200 = 8, 000 HASH JOIN: With 15 buﬀer pages the ﬁrst scan of S (the smaller relation) splits it into 14 buckets, each containing about 15 pages. To store one of these buckets (and its hash table) in memory will require f ∗ 15 pages, which is more than we have available. We must apply the Hash Join technique again to all partitions of R and S that were created by the ﬁrst partitioning phase. Then we can ﬁt an entire partition of S in memory. The total cost will be the cost of two partioning phases plus the cost of one matching phase. T otalCost = 2 ∗ (2 ∗ (M + N )) + (M + N ) = 6, 000
116
Chapter 12
4. SORT-MERGE: With 15 buﬀer pages we can sort R in three passes and S in three passes. The cost of sorting R is 2 ∗ 3 ∗ M = 6, 000, the cost of sorting S is 2 ∗ 3 ∗ N = 6, 000, and the cost of the merging phase is M + N = 2, 000. T otalCost = 6, 000 + 6, 000 + 2, 000 = 14, 000 HASH JOIN: Now both relations are the same size, so we can treat either one as the smaller relation. With 15 buﬀer pages the ﬁrst scan of S splits it into 14 buckets, each containing about 72 pages, so again we have to deal with partition overﬂow. We must apply the Hash Join technique again to all partitions of R and S that were created by the ﬁrst partitioning phase. Then we can ﬁt an entire partition of S in memory. The total cost will be the cost of two partioning phases plus the cost of one matching phase. T otalCost = 2 ∗ (2 ∗ (M + N )) + (M + N ) = 10, 000 5. SORT-MERGE: With 52 buﬀer pages we have B > on-the-ﬂy” reﬁnement which costs 3 ∗ (M + N ). √ M so we can use the ”merge-
T otalCost = 3 ∗ (1, 000 + 1, 000) = 6, 000 HASH JOIN: Now both relations are the same size, so we can treat either one as the smaller relation. With 52 buﬀer pages the ﬁrst scan of S splits it into 51 buckets, each containing about 20 pages. This time we do not have to deal with partition overﬂow. The total cost will be the cost of one partioning phase plus the cost of one matching phase. T otalCost = (2 ∗ (M + N )) + (M + N ) = 6, 000 Exercise 12.6 Answer each of the questions—if some question is inapplicable, explain why—in Exercise 12.1 again, but using the following information about R and S: Relation R contains 200,000 tuples and has 20 tuples per page. Relation S contains 4,000,000 tuples and also has 20 tuples per page. Attribute a of relation R is the primary key for R. Each tuple of R joins with exactly 20 tuples of S. 1,002 buﬀer pages are available. Answer 12.6 Answer omitted.
Exercise 12.7 We described variations of the join operation called outer joins in Section 5.6.4. One approach to implementing an outer join operation is to ﬁrst evaluate
Evaluation of Relational Operators
117
the corresponding (inner) join and then add additional tuples padded with null values to the result in accordance with the semantics of the given outer join operator. However, this requires us to compare the result of the inner join with the input relations to determine the additional tuples to be added. The cost of this comparison can be avoided by modifying the join algorithm to add these extra tuples to the result while input tuples are processed during the join. Consider the following join algorithms: block nested loops join, index nested loops join, sort-merge join, and hash join. Describe how you would modify each of these algorithms to compute the following operations on the Sailors and Reserves tables discussed in this chapter: 1. Sailors NATURAL LEFT OUTER JOIN Reserves 2. Sailors NATURAL RIGHT OUTER JOIN Reserves 3. Sailors NATURAL FULL OUTER JOIN Reserves Answer 12.7 Answer not available.
13
INTRODUCTION TO QUERY OPTIMIZATION
Exercise 13.1 Brieﬂy answer the following questions. 1. What is the goal of query optimization? Why is it important? 2. Describe the advantages of pipelining. 3. Give an example in which pipelining cannot be used. 4. Describe the iterator interface and explain its advantages. 5. What role do statistics gathered from the database play in query optimization? 6. What information is stored in the system catalogs? 7. What are the beneﬁts of making the system catalogs be relations? 8. What were the important design decisions made in the System R optimizer? Answer 13.1 1. Answer not available.
2. Pipelining allows us to avoid creating and reading temporary relations; the I/O savings can be substantial. 3. The iterator interface for an operator includes the functions open, get next, and close; it hides the details of how the operator is implemented, and allows us to view all operator nodes in a query plan uniformly. 4. The query optimizer uses statistics to improve the chances of selecting an optimum query plan. The statistics are used to calculate reduction factors which determine the results the optimizer may expect given diﬀerent indexes and inputs. 5. Answer not available. 6. Answer not available. 7. Answer not available. 8. Answer not available. 118
14
A TYPICAL QUERY OPTIMIZER
Exercise 14.1 Brieﬂy answer the following questions. 1. In the context of query optimization, what is an SQL query block? 2. Deﬁne the term reduction factor. 3. Describe a situation in which projection should precede selection in processing a project-select query, and describe a situation where the opposite processing order is better. (Assume that duplicate elimination for projection is done via sorting.) 4. If there are dense, unclustered (secondary) B+ tree indexes on both R.a and S.b, the join R a=b S could be processed by doing a sort-merge type of join—without doing any sorting—by using these indexes. (a) Would this be a good idea if R and S each have only one tuple per page, or would it be better to ignore the indexes and sort R and S? Explain. (b) What if R and S each have many tuples per page? Again, explain. 5. Why does the System R optimizer consider only left-deep join trees? Give an example of a plan that would not be considered because of this restriction. 6. Explain the role of interesting orders in the System R optimizer. Answer 14.1 1. An SQL query block is essentially an SQL query without nesting, and serves as a unit of optimization. Queries with nesting can be broken up into a collection of query blocks whose evaluation must be coordinated at runtime. 2. Answer not available. 3. If the selection is to be done on the inner relation of a simple nested loop, and the projection will reduce the number of pages occupied signiﬁcantly, then the projection should be done ﬁrst. The opposite is true in the case of an index-only join. The projections should be done on the ﬂy after the join. 119
120
Chapter 14
4. (a) Using the indexes is a good idea when R and S each have only one tuple per page. Each data page is read exactly once and the cost of scanning the B+ tree is likely to be very small. (b) Doing an actual data sort on appropriate keys may be a good idea when R and S have many tuples per page. Given that the indexes are unclustered, without sorting there is potential for many reads of a single page. After sorting, there will only be one read per matching page. The choice may be determined by number of potential matches and number of tuples per page. 5. The System-R optimizer considers only left-deep joins because they allow fully pipelined plans. As an example, non-linear plan would not be considered. 6. The System R optimizer implements a multiple pass algorithm. In each pass, it must consider adding a join to those retained in previous passes. Each level retains the cheapest plan for each interesting order for result tuples. An ordering of tuples is interesting if it is sorted on some combination of ﬁelds.
Exercise 14.2 Consider a relation with this schema: Employees(eid: integer, ename: string, sal: integer, title: string, age: integer) Suppose that the following indexes, all using Alternative (2) for data entries, exist: a hash index on eid, a B+ tree index on sal, a hash index on age, and a clustered B+ tree index on age, sal . Each Employees record is 100 bytes long, and you can assume that each index data entry is 20 bytes long. The Employees relation contains 10,000 pages. 1. Consider each of the following selection conditions and, assuming that the reduction factor (RF) for each term that matches an index is 0.1, compute the cost of the most selective access path for retrieving all Employees tuples that satisfy the condition: (a) sal > 100 (b) age = 25 (c) age > 20 (d) eid = 1, 000 (e) sal > 200 ∧ age > 30 (f) sal > 200 ∧ age = 20 (g) sal > 200 ∧ title = CF O (h) sal > 200 ∧ age > 30 ∧ title = CF O
A Typical Query Optimizer
121
2. Suppose that for each of the preceding selection conditions, you want to retrieve the average salary of qualifying tuples. For each selection condition, describe the least expensive evaluation method and state its cost. 3. Suppose that for each of the preceding selection conditions, you want to compute the average salary for each age group. For each selection condition, describe the least expensive evaluation method and state its cost. 4. Suppose that for each of the preceding selection conditions, you want to compute the average age for each sal level (i.e., group by sal). For each selection condition, describe the least expensive evaluation method and state its cost. 5. For each of the following selection conditions, describe the best evaluation method: (a) sal > 200 ∨ age = 20 (b) sal > 200 ∨ title = CF O (c) title = CF O ∧ ename = Joe Answer 14.2 Answer omitted.
Exercise 14.3 For each of the following SQL queries, for each relation involved, list the attributes that must be examined in order to compute the answer. All queries refer to the following relations: Emp(eid: integer, did: integer, sal: integer, hobby: char(20)) Dept(did: integer, dname: char(20), ﬂoor: integer, budget: real)
1. SELECT * FROM Emp 2. SELECT * FROM Emp, Dept 3. SELECT * FROM Emp E, Dept D WHERE E.did = D.did 4. SELECT E.eid, D.dname FROM Emp E, Dept D WHERE E.did = D.did 5. SELECT COUNT(*) FROM Emp E, Dept D WHERE E.did = D.did 6. SELECT MAX(E.sal) FROM Emp E, Dept D WHERE E.did = D.did 7. SELECT MAX(E.sal) FROM Emp E, Dept D WHERE E.did = D.did AND D.ﬂoor = 5 8. SELECT E.did, COUNT(*) FROM Emp E, Dept D WHERE E.did = D.did GROUP BY D.did 9. SELECT D.ﬂoor, AVG(D.budget) FROM Dept D GROUP BY D.ﬂoor HAVING COUNT(*) >2 10. SELECT D.ﬂoor, AVG(D.budget) FROM Dept D GROUP BY D.ﬂoor ORDER BY D.ﬂoor
122
Answer 14.3 1. E.eid, E.did, E.sal, E.hobby
Chapter 14
2. E.eid, E.did, E.sal, E.hobby, D.did, D.dname, D.ﬂoor, D.budget 3. E.eid, E.did, E.sal, E.hobby, D.did, D.dname, D.ﬂoor, D.budget 4. E.eid, D.dname, E.did, D.did 5. E.did, D.did 6. E.sal, E.did, D.did 7. E.sal, E.did, D.did, D.ﬂoor 8. E.did, D.did 9. D.ﬂoor, D.budget 10. D.ﬂoor, D.budget
Exercise 14.4 You are given the following information: Executives has attributes ename, title, dname, and address; all are string ﬁelds of the same length. The ename attribute is a candidate key. The relation contains 10,000 pages. There are 10 buﬀer pages.
1. Consider the following query: SELECT E.title, E.ename FROM Executives E WHERE E.title=‘CFO’ Assume that only 10 percent of Executives tuples meet the selection condition. (a) Suppose that a clustered B+ tree index on title is (the only index) available. What is the cost of the best plan? (In this and subsequent questions, be sure to describe the plan that you have in mind.) (b) Suppose that an unclustered B+ tree index on title is (the only index) available. What is the cost of the best plan? (c) Suppose that a clustered B+ tree index on ename is (the only index) available. What is the cost of the best plan? (d) Suppose that a clustered B+ tree index on address is (the only index) available. What is the cost of the best plan? (e) Suppose that a clustered B+ tree index on ename, title is (the only index) available. What is the cost of the best plan?
A Typical Query Optimizer
2. Suppose that the query is as follows:
123
SELECT E.ename FROM Executives E WHERE E.title=‘CFO’ AND E.dname=‘Toy’ Assume that only 10 percent of Executives tuples meet the condition E.title = CF O , only 10 percent meet E.dname = T oy , and that only 5 percent meet both conditions. (a) Suppose that a clustered B+ tree index on title is (the only index) available. What is the cost of the best plan? (b) Suppose that a clustered B+ tree index on dname is (the only index) available. What is the cost of the best plan? (c) Suppose that a clustered B+ tree index on title, dname is (the only index) available. What is the cost of the best plan? (d) Suppose that a clustered B+ tree index on title, ename is (the only index) available. What is the cost of the best plan? (e) Suppose that a clustered B+ tree index on dname, title, ename is (the only index) available. What is the cost of the best plan? (f) Suppose that a clustered B+ tree index on ename, title, dname is (the only index) available. What is the cost of the best plan? 3. Suppose that the query is as follows: SELECT E.title, COUNT(*) FROM Executives E GROUP BY E.title (a) Suppose that a clustered B+ tree index on title is (the only index) available. What is the cost of the best plan? (b) Suppose that an unclustered B+ tree index on title is (the only index) available. What is the cost of the best plan? (c) Suppose that a clustered B+ tree index on ename is (the only index) available. What is the cost of the best plan? (d) Suppose that a clustered B+ tree index on ename, title is (the only index) available. What is the cost of the best plan? (e) Suppose that a clustered B+ tree index on title, ename is (the only index) available. What is the cost of the best plan? 4. Suppose that the query is as follows: SELECT E.title, COUNT(*) FROM Executives E WHERE E.dname > ‘W%’ GROUP BY E.title Assume that only 10 percent of Executives tuples meet the selection condition. (a) Suppose that a clustered B+ tree index on title is (the only index) available. What is the cost of the best plan? If an additional index (on any search key that you want) is available, would it help to produce a better plan?
124
Chapter 14
(b) Suppose that an unclustered B+ tree index on title is (the only index) available. What is the cost of the best plan? (c) Suppose that a clustered B+ tree index on dname is (the only index) available. What is the cost of the best plan? If an additional index (on any search key that you want) is available, would it help to produce a better plan? (d) Suppose that a clustered B+ tree index on dname, title is (the only index) available. What is the cost of the best plan? (e) Suppose that a clustered B+ tree index on title, dname is (the only index) available. What is the cost of the best plan?
Answer 14.4 Answer omitted.
Exercise 14.5 Consider the query πA,B,C,D (R A=C S). Suppose that the projection routine is based on sorting and is smart enough to eliminate all but the desired attributes during the initial pass of the sort, and also to toss out duplicate tuples onthe-ﬂy while sorting, thus eliminating two potential extra passes. Finally, assume that you know the following: R is 10 pages long, and R tuples are 300 bytes long. S is 100 pages long, and S tuples are 500 bytes long. C is a key for S, and A is a key for R. The page size is 1,024 bytes. Each S tuple joins with exactly one R tuple. The combined size of attributes A, B, C, and D is 450 bytes. A and B are in R and have a combined size of 200 bytes; C and D are in S. 1. What is the cost of writing out the ﬁnal result? (As usual, you should ignore this cost in answering subsequent questions.) 2. Suppose that three buﬀer pages are available, and the only join method that is implemented is simple (page-oriented) nested loops. (a) Compute the cost of doing the projection followed by the join. (b) Compute the cost of doing the join followed by the projection. (c) Compute the cost of doing the join ﬁrst and then the projection on-the-ﬂy. (d) Would your answers change if 11 buﬀer pages were available? 3. Suppose that there are three buﬀer pages available, and the only join method that is implemented is block nested loops. (a) Compute the cost of doing the projection followed by the join. (b) Compute the cost of doing the join followed by the projection.
A Typical Query Optimizer
125
(c) Compute the cost of doing the join ﬁrst and then the projection on-the-ﬂy. (d) Would your answers change if 11 buﬀer pages were available? Answer 14.5 1. Note that both the R, S relations have 200 tuples. We are given that each R tuple joins with exactly one S tuple, and vice-versa, therefore the join has 200 tuples. The combined size of each resulting tuple is 450 bytes, so that there are only 2 tuples per page, for a join resulting in 100 pages. 2. (a) Cost of projection followed by join: The projection is sort-based, so we must sort relation S, which contains attributes C and D. Relation S has 100 pages, and we have 3 buﬀer pages, so the sort cost is 200*ceiling(log2(100)) = 200*7 = 1400. Assume that 1/10 of the tuples are removed as duplicates, so that there are 180 remaining tuples of S, each of size 250 bytes (combined size of attributes C, D). Therefore, 4 tuples ﬁt on a page, so the resulting size of the inner relation is 45 pages. The cost using SNL is (10 + 10*45) = 460 I/Os, for a total cost of 1860. (b) Cost of join followed by projection: SNL join is (10 + 10*100) = 1010 I/Os, and results in 200 tuples, each of size 550 bytes. Thus, only one result tuple ﬁts on a page, and we have 200 pages. The projection is a sort using 3 buﬀer pages, and in the ﬁrst pass unwanted attributes are eliminated on-the-ﬂy to produce tuples of size 450 bytes, i.e., 2 tuples per page. Thus, 200 pages are scanned and 100 pages written in the ﬁrst pass in 33 runs of 3 pages each and 1 run of a page. These runs are merged pairwise in 6 additional passes for a total projection cost of 200+100+2*6*100=1500 I/Os. This includes the cost of writing out the result of 100 pages; removing this cost and adding the cost of the join step, we obtain a total cost of 2410 I/Os. (c) Cost of join and projection on the ﬂy: This means that the projection cost is 0, so the only cost is the join, which we know from above is 1010 I/Os. (d) If we had 11 buﬀer pages, then the projection sort could be done log10 instead of log2. 3. (a) Using Block Nested Loops with 3 buﬀer pages, cost of projection, then join: The costs are the same as (a)(i) above since BNL with 3 pages is just SNL (page oriented) and the projection sorts are the same as well. (b) Cost of join and then projection: Same as 2b. (c) Cost of join and projection on the ﬂy Same as 2c (d) Now that we have 11 buﬀer pages, both BNL and projections are aﬀected.
126
Chapter 14
Part (a) : log10(100)*200 + (10 + ceiling(10/9)*45) = 500 Part (b): (10 + ceiling(10/9)*100)+(ceiling(log10(200))*400) = 210 + 1200 = 1410 Part (c): 10 + ceiling(10/9)*100 = 210
Exercise 14.6 Brieﬂy answer the following questions. 1. Explain the role of relational algebra equivalences in the System R optimizer. 2. Consider a relational algebra expression of the form σc (πl (R × S)). Suppose that the equivalent expression with selections and projections pushed as much as possible, taking into account only relational algebra equivalences, is in one of the following forms. In each case give an illustrative example of the selection conditions and the projection lists (c, l, c1, l1, etc.). (a) Equivalent maximally pushed form: πl1 (σc1 (R) × S). (b) Equivalent maximally pushed form: πl1 (σc1 (R) × σc2 (S)). (c) Equivalent maximally pushed form: σc (πl1 (πl2 (R) × S)). (d) Equivalent maximally pushed form: σc1 (πl1 (σc2 (πl2 (R)) × S)). (e) Equivalent maximally pushed form: σc1 (πl1 (πl2 (σc2 (R)) × S)). (f) Equivalent maximally pushed form: πl (σc1 (πl1 (πl2 (σc2 (R)) × S))). Answer 14.6 Answer omitted.
Exercise 14.7 Consider the following relational schema and SQL query. The schema captures information about employees, departments, and company ﬁnances (organized on a per department basis). Emp(eid: integer, did: integer, sal: integer, hobby: char(20)) Dept(did: integer, dname: char(20), ﬂoor: integer, phone: char(10)) Finance(did: integer, budget: real, sales: real, expenses: real) Consider the following query: SELECT D.dname, F.budget FROM Emp E, Dept D, Finance F WHERE E.did=D.did AND D.did=F.did AND D.ﬂoor=1 AND E.sal ≥ 59000 AND E.hobby = ‘yodeling’ 1. Identify a relational algebra tree (or a relational algebra expression if you prefer) that reﬂects the order of operations that a decent query optimizer would choose.
A Typical Query Optimizer
127
2. List the join orders (i.e., orders in which pairs of relations can be joined together to compute the query result) that a relational query optimizer will consider. (Assume that the optimizer follows the heuristic of never considering plans that require the computation of cross-products.) Brieﬂy explain how you arrived at your list. 3. Suppose that the following additional information is available: Unclustered B+ tree indexes exist on Emp.did, Emp.sal, Dept.ﬂoor, Dept.did, and Finance.did. The system’s statistics indicate that employee salaries range from 10,000 to 60,000, employees enjoy 200 diﬀerent hobbies, and the company owns two ﬂoors in the building. There are a total of 50,000 employees and 5,000 departments (each with corresponding ﬁnancial information) in the database. The DBMS used by the company has just one join method available, namely, index nested loops. (a) For each of the query’s base relations (Emp, Dept and Finance) estimate the number of tuples that would be initially selected from that relation if all of the non-join predicates on that relation were applied to it before any join processing begins. (b) Given your answer to the preceding question, which of the join orders that are considered by the optimizer has the least estimated cost? Answer 14.7 The answers are as follows. 1. πD.dname,F.budget ((πE.did (σE.sal>=59000,E.hobby=”yodelling” (E)) πD.did,D.dname(σD.f loor=1 (D))) πF.budget,F.did (F ))
2. 2) There are 2 join orders considered, assuming that the optimizer only consider left-deep joins and ignores cross-products: (D,E,F) and (D,F,E) 3. (a) Emp: card = 50,000, E.sal¿=59,000, E.hobby = ”yodelling” resulting card = 50000 * 1/50 * 1/200 = 5 Dept: card = 5000, D.ﬂoor = 1 resulting card = 5000 * 1/2 = 2500 Finance: card = 5000, there are no non-join predicates resulting card = 5000 (b) Consider the following join methods on the following left-deep tree: (E D) F ). The tuples from E will be pipelined, no temporary relations are created. First, retrieve the tuples from E with salary ¿= 59,000 using the B-tree index on salary; we estimate 1000 such tuples will be found, with a cost of 1 tree traversal + the cost of retrieving the 1000 tuples (since the index is
128
Chapter 14
unclustered) = 3+1000 = 1003. Note, we ignore the cost of scanning the leaves. Of these 1000 retrieved tuples, on the ﬂy select only those that have hobby = ”yodelling”, we estimate there will be 5 such tuples. Pipeline these 5 tuples one at a time to D, and using the B-tree index on D.did and the fact the D.did is a key, we can ﬁnd the matching tuples for the join by searching the Btree and retrieving at most 1 matching tuple, for a total cost of 5(3 + 1) = 20. The resulting cardinality of this join is at most 5. Pipeline the estimated 3 tuples of these 5 that have D.ﬂoor=1 1 up to F, and use the Btree index on F.did and the fact that F.did is a key to retrieve at most 1 F tuple for each of the 3 pipelined tuples. This costs at most 3(3+1) = 12. Ignoring the cost of writing out the ﬁnal result, we get a total cost of 1003+20+12 = 1035.
Exercise 14.8 Consider the following relational schema and SQL query: Suppliers(sid: integer, sname: char(20), city: char(20)) Supply(sid: integer, pid: integer) Parts(pid: integer, pname: char(20), price: real) SELECT S.sname, P.pname FROM Suppliers S, Parts P, Supply Y WHERE S.sid = Y.sid AND Y.pid = P.pid AND S.city = ‘Madison’ AND P.price ≤ 1,000 1. What information about these relations will the query optimizer need to select a good query execution plan for the given query? 2. How many diﬀerent join orders, assuming that cross-products are disallowed, will a System R style query optimizer consider when deciding how to process the given query? List each of these join orders. 3. What indexes might be of help in processing this query? Explain brieﬂy. 4. How does adding DISTINCT to the SELECT clause aﬀect the plans produced? 5. How does adding ORDER BY sname to the query aﬀect the plans produced? 6. How does adding GROUP BY sname to the query aﬀect the plans produced? Answer 14.8 Answer omitted.
A Typical Query Optimizer
Exercise 14.9 Consider the following scenario: Emp(eid: integer, sal: integer, age: real, did: integer) Dept(did: integer, projid: integer, budget: real, status: char(10)) Proj(projid: integer, code: integer, report: varchar)
129
Assume that each Emp record is 20 bytes long, each Dept record is 40 bytes long, and each Proj record is 2,000 bytes long on average. There are 20,000 tuples in Emp, 5,000 tuples in Dept (note that did is not a key), and 1,000 tuples in Proj. Each department, identiﬁed by did, has 10 projects on average. The ﬁle system supports 4,000 byte pages, and 12 buﬀer pages are available. The following questions are all based on this information. You can assume uniform distribution of values. State any additional assumptions. The cost metric to use is the number of page I/Os. Ignore the cost of writing out the ﬁnal result. 1. Consider the following two queries: “Find all employees with age = 30” and “Find all projects with code = 20.” Assume that the number of qualifying tuples is the same in each case. If you are building indexes on the selected attributes to speed up these queries, for which query is a clustered index (in comparison to an unclustered index) more important? 2. Consider the following query: “Find all employees with age > 30.” Assume that there is an unclustered index on age. Let the number of qualifying tuples be N . For what values of N is a sequential scan cheaper than using the index? 3. Consider the following query: SELECT * FROM Emp E, Dept D WHERE E.did=D.did (a) Suppose that there is a clustered hash index on did on Emp. List all the plans that are considered and identify the plan with the least estimated cost. (b) Assume that both relations are sorted on the join column. List all the plans that are considered and show the plan with the least estimated cost. (c) Suppose that there is a clustered B+ tree index on did on Emp and that Dept is sorted on did. List all the plans that are considered and identify the plan with the least estimated cost. 4. Consider the following query: SELECT FROM WHERE GROUP BY D.did, COUNT(*) Dept D, Proj P D.projid=P.projid D.did
130
Chapter 14
(a) Suppose that no indexes are available. Show the plan with the least estimated cost. (b) If there is a hash index on P.projid what is the plan with least estimated cost? (c) If there is a hash index on D.projid what is the plan with least estimated cost? (d) If there is a hash index on D.projid and P.projid what is the plan with least estimated cost? (e) Suppose that there is a clustered B+ tree index on D.did and a hash index on P.projid. Show the plan with the least estimated cost. (f) Suppose that there is a clustered B+ tree index on D.did, a hash index on D.projid, and a hash index on P.projid. Show the plan with the least estimated cost. (g) Suppose that there is a clustered B+ tree index on D.did, D.projid and a hash index on P.projid. Show the plan with the least estimated cost. (h) Suppose that there is a clustered B+ tree index on D.projid, D.did and a hash index on P.projid. Show the plan with the least estimated cost.
5. Consider the following query: SELECT FROM WHERE GROUP BY D.did, COUNT(*) Dept D, Proj P D.projid=P.projid AND D.budget>99000 D.did
Assume that department budgets are uniformly distributed in the range 0 to 100,000. (a) Show the plan with least estimated cost if no indexes are available. (b) If there is a hash index on P.projid show the plan with least estimated cost. (c) If there is a hash index on D.budget show the plan with least estimated cost. (d) If there is a hash index on D.projid and D.budget show the plan with least estimated cost. (e) Suppose that there is a clustered B+ tree index on D.did,D.budget and a hash index on P.projid. Show the plan with the least estimated cost. (f) Suppose that there is a clustered B+ tree index on D.did, a hash index on D.budget, and a hash index on P.projid. Show the plan with the least estimated cost. (g) Suppose that there is a clustered B+ tree index on D.did, D.budget, D.projid and a hash index on P.projid. Show the plan with the least estimated cost.
A Typical Query Optimizer
131
(h) Suppose that there is a clustered B+ tree index on D.did, D.projid, D.budget and a hash index on P.projid. Show the plan with the least estimated cost. 6. Consider the following query: SELECT E.eid, D.did, P.projid FROM Emp E, Dept D, Proj P WHERE E.sal=50,000 AND D.budget>20,000 E.did=D.did AND D.projid=P.projid Assume that employee salaries are uniformly distributed in the range 10,009 to 110,008 and that project budgets are uniformly distributed in the range 10,000 to 30,000. There is a clustered index on sal for Emp, a clustered index on did for Dept, and a clustered index on projid for Proj. (a) List all the one-relation, two-relation, and three-relation subplans considered in optimizing this query. (b) Show the plan with the least estimated cost for this query. (c) If the index on Proj were unclustered, would the cost of the preceding plan change substantially? What if the index on Emp or on Dept were unclustered? Answer 14.9 The reader should calculate actual costs of all alternative plans; in the answers below, we just outline the best plans without detailed cost calculations to prove that these are indeed the best plans. 1. The question speciﬁes that the number, rather than the fraction, of qualifying tuples is identical for the two queries. Since Emp tuples are small, many will ﬁt on a single page; conversely, few (just 2) of the large Proj tuples will ﬁt on a page. Since we wish to minimize the number of page I/Os, it will be an advantage if the Emp tuples are clustered with respect to the age index (all matching tuples will be retrieved in a few page I/Os). Clustering is not as important for the Proj tuples since almost every matching tuple will require a page I/O, even with clustering. 2. The Emp relation occupies 100 pages. For an unclustered index retrieving N tuples requires N page I/Os. If more than 100 tuples match, the cost of fetching Emp tuples by following pointers in the index data entries exceeds the cost of sequential scan. Using the index also involves about 2 I/Os to get to the right leaf page, and the cost of fetching leaf pages that contain qualifying data entries; this makes scan better than the index with fewer than 100 matches.) 3. (a) One plan is to use (simple or blocked) NL join with E as the outer. Another plan is SM or Hash join. A third plan is to use D as the outer and to use INL; given the clustered hash index on E, this plan will likely be the cheapest.
132
Chapter 14
(b) The same plans are considered as before, but now, SM join is the best strategy because both relations are sorted on the join column (and all tuples of Emp are likely to join with some tuple of Dept, and must therefore be fetched at least once, even if INL is used). (c) The same plans are considered as before. As in the previous case, SM join is the best: the clustered B+ tree index on Emp can be used to eﬃciently retrieve Emp tuples in sorted order.
4. (a) BNL with P roj as the outer, followed by sorting on did to implement the aggregation. All attributes except did can be eliminated during the join but duplicates should not be eliminated! (b) Sort Dept on did ﬁrst (all other attributes except projid can be projected out), then scan while probing P roj and counting tuples in each did group on-the-ﬂy. (c) INL with Dept as inner, followed by sorting on did to implement the aggregation. Again, all attributes except did can be eliminated during the join but duplicates should not be eliminated! (d) As in the previous case, INL with Dept as inner, followed by sorting on did to implement the aggregation. Again, all attributes except did can be eliminated during the join but duplicates should not be eliminated! (e) Scan Dept in did order using the clustered B+ tree index while probing P roj and counting tuples in each did group on-the-ﬂy. (f) Same as above. (g) Scan the clustered B+ tree index using an index-only scan while probing P roj and counting tuples in each did group on-the-ﬂy. (h) Sort the data entries in the clustered B+ tree index on Dept, then scan while probing P roj and counting tuples in each did group on-the-ﬂy. 5. (a) BNL with P roj as the outer with the selection applied on-the-ﬂy, followed by sorting on did to implement the aggregation. All attributes except did can be eliminated during the join but duplicates should not be eliminated! (b) Sort Dept on did ﬁrst (while applying the selection and projecting out all other attributes except projid in the initial scan), then scan while probing P roj and counting tuples in each did group on-the-ﬂy. (c) Select Dept tuples using the index on budget, join using INL with P roj as inner, projecting out all attributes except did. Then sort to implement the aggregation. (d) Same as the case with no index; this index does not help. (e) Retrieve Dept tuples that satisfy the condition on budget in did order by using the clustered B+ tree index while probing P roj and counting tuples in each did group on-the-ﬂy.
A Typical Query Optimizer
133
(f) Since the condition on budget is very selective, even though the index on budget is unclustered we retrieve Dept tuples using this index, project out the did and projid ﬁelds and sort them by did. Then we scan while probing P roj and counting tuple sin each did gorup on-the-ﬂy. (g) Use an index-only scan on the B+ tree and apply the condition on budget, while probing P roj and counting tuples in each did group on-the-ﬂy. Notice that this plan is applicable even if the B+ tree index is not clustered. (Within each did group, can optimize search for data entries in the index that satisfy the budget condition, but this is a minor gain.) (h) Use an index-only scan on the B+ tree and apply the condition on budget, while probing P roj and counting tuples in each did group on-the-ﬂy. 6. (a) 1-relation subplans: Clustered index on E.sal; Scan Dept; and Scan P roj. 2-relation subplans: (i) Clustered index on E.sal, probe Dept using the index on did, apply predicate on D.budget and join. (ii) Scan Dept, apply predicate on D.budget and probe P roj. (iii) Scan P roj, probe Dept and apply predicate on D.budget and join. 3-relation subplans: Join Emp and Dept and probe P roj; Join Dept and P roj and probe Emp. (b) The least cost plan is to use the index on E.sal to eliminate most tuples, probe Dept using the index on D.did, apply the predicate on D.budget, probe and join on P roj.projid. (c) Unclustering the index on P roj would increase the number of I/Os but not substantially since the total number of matching P roj tuples to be retrieved is small.
15
SCHEMA REFINEMENT AND NORMAL FORMS
Exercise 15.1 Brieﬂy answer the following questions. 1. Deﬁne the term functional dependency. 2. Give a set of FDs for the relation schema R(A,B,C,D) with primary key AB under which R is in 1NF but not in 2NF. 3. Give a set of FDs for the relation schema R(A,B,C,D) with primary key AB under which R is in 2NF but not in 3NF. 4. Consider the relation schema R(A,B,C), which has the FD B → C. If A is a candidate key for R, is it possible for R to be in BCNF? If so, under what conditions? If not, explain why not. 5. Suppose that we have a relation schema R(A,B,C) representing a relationship between two entity sets with keys A and B, respectively, and suppose that R has (among others) the FDs A → B and B → A. Explain what such a pair of dependencies means (i.e., what they imply about the relationship that the relation models). Answer 15.1 1. Let R be a relational schema and let X and Y be two subsets of the set of all attributes of R. We say Y is functionally dependent on X, written X → Y, if the Y-values are determined by the X-values. More precisely, for any two tuples r1 and r2 in (any instance of) R πX (r1 ) = πX (r2 ) ⇒ πY (r1 ) = πY (r2 )
2. Consider the FD: A → C. More generally any (non-trivial) FD: X → α, with α not equal to A or B, and X = A or X = B will violate 2NF.
134
Schema Reﬁnement and Normal Forms
135
3. Consider the FD: D → C. More generally any (non-trivial) FD: X → α, with α not equal to A or B, and X not a proper subset of { A, B } X does not contain AB, will violate 3NF but not 2NF. 4. The only way R could be in BCNF is if B includes a key, i.e. B is a key for R. 5. It means that the relationship is one to one. That is, each A entity corresponds to at most one B entity and vice-versa. (In addition, we have the dependency AB → C, from the semantics of a relationship set.) Exercise 15.2 Consider a relation R with ﬁve attributes ABCDE. You are given the following dependencies: A → B, BC → E, and ED → A. 1. List all keys for R. 2. Is R in 3NF? 3. Is R in BCNF? Answer 15.2 Answer omitted.
Exercise 15.3 Consider the following collection of relations and dependencies. Assume that each relation is obtained through decomposition from a relation with attributes ABCDEFGHI and that all the known dependencies over relation ABCDEFGHI are listed for each question. (The questions are independent of each other, obviously, since the given dependencies over ABCDEFGHI are diﬀerent.) For each (sub) relation: (a) State the strongest normal form that the relation is in. (b) If it is not in BCNF, decompose it into a collection of BCNF relations. 1. R1(A,C,B,D,E), A → B, C → D 2. R2(A,B,F), AC → E, B → F 3. R3(A,D), D → G, G → H 4. R4(D,C,H,G), A → I, I → A 5. R5(A,I,C,E) Answer 15.3 1. Not in 3NF. BCNF decomposition: AB, CD, ACE. 2. Not in 3NF. BCNF decomposition: AB, BF 3. BCNF.
136
4. BCNF. 5. BCNF.
Chapter 15
Exercise 15.4 Suppose that we have the following three tuples in a legal instance of a relation schema S with three attributes ABC (listed in order): (1,2,3), (4,2,3), and (5,3,3). 1. Which of the following dependencies can you infer does not hold over schema S? (a) A → B (b) BC → A (c) B → C 2. Can you identify any dependencies that hold over S? Answer 15.4 Answer omitted.
Exercise 15.5 Suppose you are given a relation R with four attributes, ABCD. For each of the following sets of FDs, assuming those are the only dependencies that hold for R, do the following: (a) Identify the candidate key(s) for R. (b) Identify the best normal form that R satisﬁes (1NF, 2NF, 3NF, or BCNF). (c) If R is not in BCNF, decompose it into a set of BCNF relations that preserve the dependencies. 1. C → D, C → A, B → C 2. B → C, D → A 3. ABC → D, D → A 4. A → B, BC → D, A → C 5. AB → C, AB → D, C → A, D → B Answer 15.5 1. (a) Candidate keys: B (b) R is in 2NF but not 3NF. (c) C → D and C → A both cause violations of BCNF. One way to obtain a (lossless) join preserving decomposition is to decompose R into AC, BC, and CD. 2. (a) Candidate keys: BD (b) R is in 1NF but not 2NF. (c) Both B → C and D → A cause BCNF violations. The decomposition: AD, BC, BD (obtained by ﬁrst decomposing to AD, BCD) is BCNF and lossless and join-preserving.
Schema Reﬁnement and Normal Forms
3. (a) Candidate keys: ABC, BCD (b) R is in 3NF but not BCNF.
137
(c) ABCD is not in BCNF since D → A and D is not a key. However if we split up R as AD, BCD we cannot preserve the dependency ABC → D. So there is no BCNF decomposition. 4. (a) Candidate keys: A (b) R is in 2NF but not 3NF (because of the FD: BC → D). (c) BC → D violates BCNF since BC does not contain a key. So we split up R as in: BCD, ABC. 5. (a) Candidate keys: AB, BC, CD, AD (b) R is in 3NF but not BCNF (because of the FD: C → A). (c) C → A and D → B both cause violations. So decompose into: AC, BCD but this does not preserve AB → C and AB → D, and BCD is still not BCNF because D → B. So we need to decompose further into: AC, BD, CD. Now add ABC and ABD to get the ﬁnal decomposition: AC, BD, CD, ABC, ABD which is BCNF, lossless and join-preserving. Exercise 15.6 Consider the attribute set R = ABCDEGH and the FD set F = {AB → C, AC → B, AD → E, B → D, BC → A, E → G}. 1. For each of the following attribute sets, do the following: (i) Compute the set of dependencies that hold over the set and write down a minimal cover. (ii) Name the strongest normal form that is not violated by the relation containing these attributes. (iii) Decompose it into a collection of BCNF relations if it is not in BCNF. (a) ABC (b) ABCD (c) ABCEG (d) DCEGH (e) ACEH 2. Which of the following decompositions of R = ABCDEG, with the same set of dependencies F , is (a) dependency-preserving? (b) lossless-join? (a) {AB, BC, ABDE, EG } (b) {ABC, ACDE, ADG } Answer 15.6 Answer omitted.
Exercise 15.7 Let R be decomposed into R1 , R2 , . . ., Rn . Let F be a set of FDs on R. 1. Deﬁne what it means for F to be preserved in the set of decomposed relations.
138
Chapter 15
2. Describe a polynomial-time algorithm to test dependency-preservation. 3. Projecting the FDs stated over a set of attributes X onto a subset of attributes Y requires that we consider the closure of the FDs. Give an example where considering the closure is important in testing dependency-preservation; that is, considering just the given FDs gives incorrect results. Answer 15.7 1. Let Fi denote the projection of F on Ri . F is preserved if the closure of the (union of) the Fi ’s equals F (note that F is always a superset of this closure.) 2. We shall describe an algorithm for testing dependency preservation which is polynomial in the cardinality of F. For each dependency X →Y ∈ F check if it is in F as follows: start with the set S (of attributes in) X. For each relation Ri , compute the closure of S ∩ Ri relative to F and project this closure to the attributes of Ri . If this results in additional attributes, add them to S. Do this repeatedly until S there is no change to S. 3. There is an example in the text. Exercise 15.8 Consider a relation R that has three attributes ABC. It is decomposed into relations R1 with attributes AB and R2 with attributes BC. 1. State the deﬁnition of a lossless-join decomposition with respect to this example. Answer this question concisely by writing a relational algebra equation involving R, R1 , and R2 . 2. Suppose that B →→ C. Is the decomposition of R into R1 and R2 lossless-join? Reconcile your answer with the observation that neither of the FDs R1 ∩ R2 → R1 nor R1 ∩R2 → R2 hold, in light of the simple test oﬀering a necessary and suﬃcient condition for lossless-join decomposition into two relations in Section 15.6.1. 3. If you are given the following instances of R1 and R2 , what can you say about the instance of R from which these were obtained? Answer this question by listing tuples that are deﬁnitely in R and listing tuples that are possibly in R. Instance of R1 = {(5,1), (6,1)} Instance of R2 = {(1,8), (1,9)} Can you say that attribute B deﬁnitely is or is not a key for R? Answer 15.8 Answer omitted. Exercise 15.9 Suppose you are given a relation R(A,B,C,D). For each of the following sets of FDs, assuming they are the only dependencies that hold for R, do the following: (a) Identify the candidate key(s) for R. (b) State whether or not the proposed decomposition of R into smaller relations is a good decomposition, and brieﬂy explain why or why not.
Schema Reﬁnement and Normal Forms
1. B → C, D → A; decompose into BC and AD. 2. AB → C, C → A, C → D; decompose into ACD and BC. 3. A → BC, C → AD; decompose into ABC and AD. 4. A → B, B → C, C → D; decompose into AB and ACD. 5. A → B, B → C, C → D; decompose into AB, AD and CD. Answer 15.9
139
1. Candidate key(s): BD. The decomposition into BC and AD is unsatisfactory because it is lossy (the join of BC and AD is the cartesian product which could be much bigger than ABCD) 2. Candidate key(s): AB, BC. The decomposition into ACD and BC is lossless since ACD ∩ BC (which is C) →ACD. The projection of the FD’s on ACD include C →D, C →A (so C is a key for ACD) and the projection of FD on BC produces no nontrivial dependencies. In particular this is a BCNF decomposition (check that R is not!). However, it is not dependency preserving since the dependency AB →C is not preserved. So to enforce preservation of this dependency (if we do not want to use a join) we need to add ABC which introduces redundancy. So implicitly there is some redundancy across relations (although none inside ACD and BC). 3. Candidate key(s): A, C Since A and C are both candidate keys for R, it is already in BCNF. So from a normalization standpoint it makes no sense to decompose R further. 4. Candidate key(s): A The projection of the dependencies on AB are: A →B and those on ACD are: A →C and C →D (rest follow from these). The scheme ACD is not even in 3NF, since C is not a superkey, and D is not part of a key. This is a lossless-join decomposition (since A is a key), but not dependency preserving, since B →C is not preserved. 5. Candidate key(s): A (just as before) This is a lossless BCNF decomposition (easy to check!) This is, however, not dependency preserving (B consider →C). So it is not free of (implied) redundancy. This is not the best decomposition ( the decomposition AB, BC, CD is better.)
Exercise 15.10 Suppose that we have the following four tuples in a relation S with three attributes ABC: (1,2,3), (4,2,3), (5,3,3), (5,3,4). Which of the following functional (→) and multivalued (→→) dependencies can you infer does not hold over relation S?
1. A → B
140
2. A →→ B 3. BC → A 4. BC →→ A 5. B → C 6. B →→ C Answer 15.10 Answer omitted.
Chapter 15
Exercise 15.11 Consider a relation R with ﬁve attributes ABCDE. 1. For each of the following instances of R, state whether (a) it violates the FD BC → D, and (b) it violates the MVD BC →→ D: (a) { } (i.e., empty relation) (b) {(a,2,3,4,5), (2,a,3,5,5)} (c) {(a,2,3,4,5), (2,a,3,5,5), (a,2,3,4,6)} (d) {(a,2,3,4,5), (2,a,3,4,5), (a,2,3,6,5)} (e) {(a,2,3,4,5), (2,a,3,7,5), (a,2,3,4,6)} (f) {(a,2,3,4,5), (2,a,3,4,5), (a,2,3,6,5), (a,2,3,6,6)} (g) {(a,2,3,4,5), (a,2,3,6,5), (a,2,3,6,6), (a,2,3,4,6)} 2. If each instance for R listed above is legal, what can you say about the FD A → B? Answer 15.11 1. Note: The answer sometimes depends on the value of a. Unless otherwise mentioned, the answer applies to all values of a. (a) { } (i.e., empty relation): does not violate either dependency. (b) {(a,2,3,4,5), (2,a,3,5,5)}: BC →D is violated if and only if a = 2. BC →→D is not violated (for any value of a) (c) {(a,2,3,4,5), (2,a,3,5,5), (a,2,3,4,6)}: BC →D is violated if a = 2 (otherwise not). If a = 2 then BC →→D is violated (consider the tuples (2,a,3,5,5) and (a,2,3,4,6); if a equals 2 must also have (2,a,3,5,6) )
Schema Reﬁnement and Normal Forms
141
(d) {(a,2,3,4,5), (2,a,3,4,5), (a,2,3,6,5)}: BC →D is violated (consider the ﬁrst and the third tuples ((a,2,3,4,5) and (a,2,3,6,5) ). BC →→D is not violated. (e) {(a,2,3,4,5), (2,a,3,7,5), (a,2,3,4,6)}: If a = 2 then BC →D is violated (otherwise it is not). If a = 2 then BC →→D is violated (otherwise it is not). To prove this look at the last two tuples; there must also be a tuple (2,a,3,7,6) for BC →→to hold. (f) {(a,2,3,4,5), (2,a,3,4,5), (a,2,3,6,5), (a,2,3,6,6)}: BC →D does not hold. (Consider the ﬁrst and the third tuple). BC →→C is violated. Consider the 1st and the 4th tuple. For this dependency to hold there should be a tuple (a,2,3,4,6). (g) {(a,2,3,4,5), (a,2,3,6,5), (a,2,3,6,6), (a,2,3,4,6)}: BC →D does not hold. (Consider the ﬁrst and the third tuple). BC →→C is not violated. 2. We cannot say anything about the functional dependency A →B. Exercise 15.12 JDs are motivated by the fact that sometimes a relation that cannot be decomposed into two smaller relations in a lossless-join manner can be so decomposed into three or more relations. An example is a relation with attributes supplier, part, and project, denoted SPJ, with no FDs or MVDs. The JD {SP, P J, JS} holds. From the JD, the set of relation schemes SP, PJ, and JS is a lossless-join decomposition of SPJ. Construct an instance of SPJ to illustrate that no two of these schemes suﬃce. Answer 15.12 Answer omitted.
Exercise 15.13 Consider a relation R with attributes ABCDE. Let the following FDs be given: A → BC, BC → E, and E → DA. Similarly, let S be a relation with attributes ABCDE and let the following FDs be given: A → BC, B → E, and E → DA. (Only the second dependency diﬀers from those that hold over R.) You do not know whether or which other (join) dependencies hold. 1. Is R in BCNF? 2. Is R in 4NF? 3. Is R in 5NF? 4. Is S in BCNF?
142
5. Is S in 4NF? 6. Is S in 5NF? Answer 15.13
Chapter 15
1. The schema R has keys A, E and BC. It follows that R is indeed in BCNF. 2. By Exercise 23, Part 1 it follows that R is also in 4NF (since the relation scheme has a single-attribute key). 3. R is in 5NF because the schema does not have any JD (besides those that are implied by the FD’s of the schema; but these cannot violate the 5NF condition). Note that this alternative argument may be used in some of the other parts of this problem as well. 4. The schema S has keys A, B and E. It follows that S is indeed in BCNF. 5. By exercise 23 (part 1) it follows that S is also in 4NF (since the relation scheme has a single-attribute key). 6. By exercise 23 (part 2) it follows that S is also in 5NF (since each key is a single-attribute key.) Exercise 15.14 Let us say that an FD X → Y is simple if Y is a single attribute. 1. Replace the FD AB → CD by the smallest equivalent collection of simple FDs. 2. Prove that every FD X → Y in a set of FDs F can be replaced by a set of simple FDs such that F + is equal to the closure of the new set of FDs. Answer 15.14 Answer omitted.
Exercise 15.15 Prove that Armstrong’s Axioms are sound and complete for FD inference. That is, show that repeated application of these axioms on a set F of FDs produces exactly the dependencies in F + . Answer 15.15 Proof omitted.
Exercise 15.16 Describe a linear-time (in the size of the set of FDs, where the size of each FD is the number of attributes involved) algorithm for ﬁnding the attribute closure of a set of attributes with respect to a set of FDs. Answer 15.16 Answer omitted.
Schema Reﬁnement and Normal Forms
143
Exercise 15.17 Consider a scheme R with FDs F that is decomposed into schemes with attributes X and Y. Show that this is dependency-preserving if F ⊆ (FX ∪FY )+ . Answer 15.17 We need to show that F + = (FX ∪ FY )+ . Both containments are
based on two observations: 1. If A ⊆ B are two sets of FD’s then A+ ⊆ B + and 2. A++ = A+ . The includsion (FX ∪FY )+ ⊆ F + follows from observing that, by deﬁnition, FX ⊆ F + and FY ⊆ F + so that FX ∪ FY ⊆ F + (now apply observations 1 and 2). The other containment, F + ⊆ (FX ∪FY )+ follows from the hypothesis, F ⊆ (FX ∪FY )+ and observations 1 and 2.
Exercise 15.18 Let R be a relation schema with a set F of FDs. Prove that the decomposition of R into R1 and R2 is lossless-join if and only if F + contains R1 ∩ R2 → R1 or R1 ∩ R2 → R2 . Answer 15.18 Answer omitted.
Exercise 15.19 Prove that the optimization of the algorithm for lossless-join, dependencypreserving decomposition into 3NF relations (Section 15.7.2) is correct. Answer 15.19
Exercise 15.20 Prove that the 3NF synthesis algorithm produces a lossless-join decomposition of the relation containing all the original attributes. Answer 15.20 Answer omitted. Exercise 15.21 Prove that an MVD X →→ Y over a relation R can be expressed as the join dependency {XY, X(R − Y )}. Answer 15.21 Write Z = R − Y . Thus, R = Y XZ. X →→ Y says that if (y1 , x, z1 ), (y2 , x, z2 ) ∈ R then (y1 , x, z2 ), (y2 , x, z1 ) also ∈ R. But this is precisely the same as saying R = { XY, X(R − Y ) }. Exercise 15.22 Prove that if R has only one key, it is in BCNF if and only if it is in 3NF.
144
Answer 15.22 Answer omitted.
Chapter 15
Exercise 15.23 Prove that if R is in 3NF and every key is simple, then R is in BCNF. Answer 15.23 Answer omitted. Exercise 15.24 Prove these statements: 1. If a relation scheme is in BCNF and at least one of its keys consists of a single attribute, it is also in 4NF. 2. If a relation scheme is in 3NF and each key has a single attribute, it is also in 5NF. Answer 15.24 Answer omitted. Exercise 15.25 Give an algorithm for testing whether a relation scheme is in BCNF. The algorithm should be polynomial in the size of the set of given FDs. (The size is the sum over all FDs of the number of attributes that appear in the FD.) Is there a polynomial algorithm for testing whether a relation scheme is in 3NF? Answer 15.25 Fix some instance of the schema r, and take two tuples λ and µ with λ[X] = µ[X] Let U denote the set of all the attributes of the schema and let z1 = λ[Z] and z2 = µ[Z]. We need to show that z1 = z2 . The MVD X →→Z implies the existence of a tuple ν such that ν[X] = λ[X] = µ[X], ν[Z] = λ[Z] = z1 and ν[U − XZ] = µ[U − XZ] = z2 . Since, Y and Z are disjoint, we have Y = (Y ∩ X) ∪ (Y ∩ (U − XZ)). Also, ν and µ agree on X as well as on U − XZ. Thus, we conclude: ν[Y ] = µ[Y ]. From the FD Y →Z we can then conclude that ν[Z] = µ[Z], or z1 = z2 . Exercise 15.26 Give an algorithm for testing whether a relation scheme is in BCNF. The algorithm should be polynomial in the size of the set of given FDs. (The ‘size’ is the sum over all FDs of the number of attributes that appear in the FD.) Is there a polynomial algorithm for testing whether a relation scheme is in 3NF? Answer 15.26 Answer omitted.
16
PHYSICAL DATABASE DESIGN AND TUNING
Exercise 16.1 Consider the following relations: Emp(eid: integer, ename: varchar, sal: integer, age: integer, did: integer) Dept(did: integer, budget: integer, ﬂoor: integer, mgr eid: integer) Salaries range from $10,000 to $100,000, ages vary from 20 to 80, each department has about ﬁve employees on average, there are 10 ﬂoors, and budgets vary from $10,000 to $1,000,000. You can assume uniform distributions of values. For each of the following queries, which of the listed index choices would you choose to speed up the query? If your database system does not consider index-only plans (i.e., data records are always retrieved even if enough information is available in the index entry), how would your answer change? Explain brieﬂy. 1. Query: Print ename, age, and sal for all employees. (a) Clustered, dense hash index on ename, age, sal ﬁelds of Emp. (b) Unclustered hash index on ename, age, sal ﬁelds of Emp. (c) Clustered, sparse B+ tree index on ename, age, sal ﬁelds of Emp. (d) Unclustered hash index on eid, did ﬁelds of Emp. (e) No index. 2. Query: Find the dids of departments that are on the 10th ﬂoor and that have a budget of less than $15,000. (a) Clustered, dense hash index on the f loor ﬁeld of Dept. (b) Unclustered hash index on the f loor ﬁeld of Dept. (c) Clustered, dense B+ tree index on f loor, budget ﬁelds of Dept. (d) Clustered, sparse B+ tree index on the budget ﬁeld of Dept. 145
146
(e) No index.
Chapter 16
3. Query: Find the names of employees who manage some department and have a salary greater than $12,000. (a) Clustered, sparse B+ tree index on the sal ﬁeld of Emp. (b) Clustered hash index on the did ﬁeld of Dept. (c) Unclustered hash index on the did ﬁeld of Dept. (d) Unclustered hash index on the did ﬁeld of Emp. (e) Clustered B+ tree index on sal ﬁeld of Emp and clustered hash index on the did ﬁeld of Dept. 4. Query: Print the average salary for each department. (a) Clustered, sparse B+ tree index on the did ﬁeld of Emp. (b) Clustered, dense B+ tree index on the did ﬁeld of Emp. (c) Clustered, dense B+ tree index on did, sal ﬁelds of Emp. (d) Unclustered hash index on did, sal ﬁelds of Emp. (e) Clustered, dense B+ tree index on the did ﬁeld of Dept. Answer 16.1 1. We should create an unclustered hash index on ename, age, sal ﬁelds of Emp (b) since then we could do an index only scan. If our system does not include index only plans then we shouldn’t create an index for this query (e). Since this query requires us to access all the Emp records, an index won’t help us any, and so should we access the records using a ﬁlescan. 2. We should create a clustered dense B+ tree index (c) on f loor, budget ﬁelds of Dept, since the records would be ordered on these ﬁelds then. So when executing this query, the ﬁrst record with f loor = 10 must be retrieved, and then the other records with f loor = 10 can be read in order. Note that this plan, which is the best for this query, is not an index-only plan. 3. We should create a hash index on the eid ﬁeld of Emp (d) since then we can do a ﬁlescan on Dept and hash each manager id into Emp and check to see if the salary is greater than 12,000 dollars. None of the indexes oﬀered lend themselves to index-only scans, so it doesn’t matter if they are allowed or not. 4. For this query we should create a dense clustered B+ tree (c) index on did, sal ﬁelds of the Emp relation, so we can do an index only scan. (An unclustered index would be suﬃcient, but is not included in the list of index choices for this question.) If index-only scans are not allowed, we should then create a clustered sparse B+ tree index on the did ﬁeld of Emp (a). This index will be just as eﬃcient as an index on did, sal since both will involve accessing the data records in did order. However, since we now have only one attribute in the search key, it will be updated less often.
Physical Database Design and Tuning
Exercise 16.2 Consider the following relation: Emp(eid: integer, sal: integer, age: real, did: integer) There is a clustered index on eid and an unclustered index on age.
147
1. Which factors would you consider in deciding whether to make an index on a relation a clustered index? Would you always create at least one clustered index on every relation? 2. How would you use the indexes to enforce the constraint that eid is a key? 3. Give an example of an update that is deﬁnitely speeded up because of the available indexes. (English description is suﬃcient.) 4. Give an example of an update that is deﬁnitely slowed down because of the indexes. (English description is suﬃcient.) 5. Can you give an example of an update that is neither speeded up nor slowed down by the indexes? Answer 16.2 Answer omitted.
Exercise 16.3 Consider the following BCNF schema for a portion of a simple corporate database (type information is not relevant to this question and is omitted):
Emp (eid, ename, addr, sal, age, yrs, deptid) Dept (did, dname, ﬂoor, budget) Suppose you know that the following queries are the six most common queries in the workload for this corporation and that all six are roughly equivalent in frequency and importance: List the id, name, and address of employees in a user-speciﬁed age range. List the id, name, and address of employees who work in the department with a user-speciﬁed department name. List the id and address of employees with a user-speciﬁed employee name. List the overall average salary for employees. List the average salary for employees of each age; that is, for each age in the database, list the age and the corresponding average salary. List all the department information, ordered by department ﬂoor numbers.
148
Chapter 16
1. Given this information, and assuming that these queries are more important than any updates, design a physical schema for the corporate database that will give good performance for the expected workload. In particular, decide which attributes will be indexed and whether each index will be a clustered index or an unclustered index. Assume that B+ tree indexes are the only index type supported by the DBMS and that both single- and multiple-attribute keys are permitted. Specify your physical design by identifying the attributes that you recommend indexing on via clustered or unclustered B+ trees. 2. Redesign the physical schema assuming that the set of important queries is changed to be the following: List the id and address of employees with a user-speciﬁed employee name. List the overall maximum salary for employees. List the average salary for employees by department; that is, for each deptid value, list the deptid value and the average salary of employees in that department. List the sum of the budgets of all departments by ﬂoor; that is, for each ﬂoor, list the ﬂoor and the sum. If we create a dense unclustered B+ tree index on age, sal Answer 16.3 1. of the Emp relation we will be able to do an index-only scan to answer the 5th query. A hash index would not serve our purpose here, since the data entries will not be ordered by age! If index only scans are not allowed create a clustered B+ tree index on just the age ﬁeld of Emp. We should create an unclustered B+Tree index on deptid of the Emp relation and another unclustered index on dname, did in the Dept relation. Then, we can do an index only search on Dept and then get the Emp records with the proper deptid’s for the second query. We should create an unclustered index on ename of the Emp relation for the third query. We want a clustered sparse B+ tree index on f loor of the Dept index so we can get the department on each ﬂoor in f loor order for the sixth query. Finally, a dense unclustered index on sal will allow us to average the salaries of all employees using an index only-scan. However, the dense unclustered B+ tree index on age, sal that we created to support Query (5) can also be used to compute the average salary of all employees, and is almost as good for this query as an index on just sal. So we should not create a separate index on just sal. If index-only scans are not allowed. no index is useful in ﬁnding the average salary of all employees; we would simply use a ﬁle scan.
Physical Database Design and Tuning
2.
149
We should create an unclustered B+Tree index on ename for the Emp relation so we can eﬃciently ﬁnd employees with a particular name for the ﬁrst query. This is not an index-only plan. An unclustered B+ tree index on sal for the Emp relation will help ﬁnd the maximum salary for the second query. (This is better than a hash index because the aggregate operation involved is MAX—we can simply go down to the rightmost leaf page in the B+ tree index.) This is not an index-only plan. We should create a dense unclustered B+ tree index on deptid, sal of the Emp relation so we can do an index-only scan on all of a department’s employees. If index only plans are not supported, a sparse, clustered B+ tree index on deptid would be best. It would allow us to retrieve tuples by deptid. We should create a dense, unclustered index on f loor, budget for Dept. This would allow us to sum budgets by ﬂoor using an index only plan. If index-only plans are not supported, we should create a sparse clustered B+ tree index on f loor for the Dept relation, so we can ﬁnd the departments on each ﬂoor in order by ﬂoor.
Exercise 16.4 Consider the following BCNF relational schema for a portion of a university database (type information is not relevant to this question and is omitted):
Prof(ssno, pname, oﬃce, age, sex, specialty, dept did) Dept(did, dname, budget, num majors, chair ssno) Suppose you know that the following queries are the ﬁve most common queries in the workload for this university and that all ﬁve are roughly equivalent in frequency and importance: List the names, ages, and oﬃces of professors of a user-speciﬁed sex (male or female) who have a user-speciﬁed research specialty (e.g., recursive query processing). Assume that the university has a diverse set of faculty members, making it very uncommon for more than a few professors to have the same research specialty. List all the department information for departments with professors in a userspeciﬁed age range. List the department id, department name, and chairperson name for departments with a user-speciﬁed number of majors. List the lowest budget for a department in the university. List all the information about professors who are department chairpersons.
150
Chapter 16
These queries occur much more frequently than updates, so you should build whatever indexes you need to speed up these queries. However, you should not build any unnecessary indexes, as updates will occur (and would be slowed down by unnecessary indexes). Given this information, design a physical schema for the university database that will give good performance for the expected workload. In particular, decide which attributes should be indexed and whether each index should be a clustered index or an unclustered index. Assume that both B+ trees and hashed indexes are supported by the DBMS and that both single- and multiple-attribute index search keys are permitted. 1. Specify your physical design by identifying the attributes that you recommend indexing on, indicating whether each index should be clustered or unclustered and whether it should be a B+ tree or a hashed index. 2. Redesign the physical schema assuming that the set of important queries is changed to be the following: List the number of diﬀerent specialties covered by professors in each department, by department. Find the department with the fewest majors. Find the youngest professor who is a department chairperson. Answer 16.4 Answer omitted. Exercise 16.5 Consider the following BCNF relational schema for a portion of a company database (type information is not relevant to this question and is omitted): Project(pno, proj name, proj base dept, proj mgr, topic, budget) Manager(mid, mgr name, mgr dept, salary, age, sex) Note that each project is based in some department, each manager is employed in some department, and the manager of a project need not be employed in the same department (in which the project is based). Suppose you know that the following queries are the ﬁve most common queries in the workload for this university and that all ﬁve are roughly equivalent in frequency and importance: List the names, ages, and salaries of managers of a user-speciﬁed sex (male or female) working in a given department. You can assume that while there are many departments, each department contains very few project managers. List the names of all projects with managers whose ages are in a user-speciﬁed range (e.g., younger than 30). List the names of all departments such that a manager in this department manages a project based in this department.
Physical Database Design and Tuning
List the name of the project with the lowest budget. List the names of all managers in the same department as a given project.
151
These queries occur much more frequently than updates, so you should build whatever indexes you need to speed up these queries. However, you should not build any unnecessary indexes, as updates will occur (and would be slowed down by unnecessary indexes). Given this information, design a physical schema for the company database that will give good performance for the expected workload. In particular, decide which attributes should be indexed and whether each index should be a clustered index or an unclustered index. Assume that both B+ trees and hashed indexes are supported by the DBMS, and that both single- and multiple-attribute index keys are permitted.
1. Specify your physical design by identifying the attributes that you recommend indexing on, indicating whether each index should be clustered or unclustered and whether it should be a B+ tree or a hashed index. 2. Redesign the physical schema assuming that the set of important queries is changed to be the following: Find the total of the budgets for projects managed by each manager; that is, list proj mgr and the total of the budgets of projects managed by that manager, for all values of proj mgr. Find the total of the budgets for projects managed by each manager but only for managers who are in a user-speciﬁed age range. Find the number of male managers. Find the average age of managers. For the ﬁrst query, we should create a dense unclustered hash Answer 16.5 1. index on mgr dept for the Manager relation. We omit sex from the key in this index since it is not very selective; however, including it is probably not very expensive since this ﬁeld is unlikely to be updated. We should create a unclustered B+ tree index on age, mgr dept, mid for the Manager relation, and an unclustered hash index on proj base dept, proj mgr for the Project relation. We can do an index only scan to ﬁnd managers whose age is in the speciﬁed range, and then hash into the Project relation to get the project names. If index only scans are not supported, the index on manager should be a clustered index on age. For the third query we don’t need a new index. We can scan all managers and use the hash index on proj base dept, proj mgr on the Project relation to check if mgr dept = proj base dept. We can create an unclustered btree index on budget in the Project relation and then go down the tree to ﬁnd the lowest budget for the fourth query.
152
Chapter 16
For the ﬁfth query, we should create dense unclustered hash index on pno for the Project relation. We can can get the proj base dept of the project by using this index, and then use the hash index on mgr dept to get the managers in this department. Note that an index on pno, proj base dept for Project would allow us to do an index only scan on Project. However, since there is exactly one base department for each project (pno is the key) this is not likely to be signiﬁcantly faster. (It does save us one I/O per project.)
2.
For the ﬁrst query, we should create an unclustered B+Tree index on proj mgr, budget for the Project relation. An index only scan can then be used to solve the query. If index only scans are not supported, a clustered index on proj mgr would be best. If we create a sparse clustered B+ tree index on age, mid for Manager, we can do an index only scan on this index to ﬁnd the ids of managers in the given range. Then, we can use an index only scan of the B+Tree index on proj mgr, budget to compute the total of the budgets of the projects that each of these managers manages. If index only scans are not supported, the index on Manager should be a clustered B+ tree index on age. An unclustered hash index on sex will divide the managers by sex and allow us to count the number that are male using an index only scan. If index only scans are not allowed, then no index will help us for the third query. We should create an unclustered hash index on age for the fourth query. All we need to do is average the ages using an index-only scan. If index-only plans are not allowed no index will help us.
Exercise 16.6 The Globetrotters Club is organized into chapters. The president of a chapter can never serve as the president of any other chapter, and each chapter gives its president some salary. Chapters keep moving to new locations, and a new president is elected when (and only when) a chapter moves. The above data is stored in a relation G(C,S,L,P), where the attributes are chapters (C), salaries (S), locations (L), and presidents (P). Queries of the following form are frequently asked, and you must be able to answer them without computing a join: “Who was the president of chapter X when it was in location Y?” 1. List the FDs that are given to hold over G. 2. What are the candidate keys for relation G? 3. What normal form is the schema G in? 4. Design a good database schema for the club. (Remember that your design must satisfy the query requirement stated above!) 5. What normal form is your good schema in? Give an example of a query that is likely to run slower on this schema than on the relation G.
Physical Database Design and Tuning
153
6. Is there a lossless-join, dependency-preserving decomposition of G into BCNF? 7. Is there ever a good reason to accept something less than 3NF when designing a schema for a relational database? Use this example, if necessary adding further constraints, to illustrate your answer. Answer 16.6 Answer omitted. Exercise 16.7 Consider the following BCNF relation, which lists the ids, types (e.g., nuts or bolts), and costs of various parts, along with the number that are available or in stock: Parts (pid, pname, cost, num avail) You are told that the following two queries are extremely important: Find the total number available by part type, for all types. (That is, the sum of the num avail value of all nuts, the sum of the num avail value of all bolts, etc.) List the pids of parts with the highest cost. 1. Describe the physical design that you would choose for this relation. That is, what kind of a ﬁle structure would you choose for the set of Parts records, and what indexes would you create? 2. Suppose that your customers subsequently complain that performance is still not satisfactory (given the indexes and ﬁle organization that you chose for the Parts relation in response to the previous question). Since you cannot aﬀord to buy new hardware or software, you have to consider a schema redesign. Explain how you would try to obtain better performance by describing the schema for the relation(s) that you would use and your choice of ﬁle organizations and indexes on these relations. 3. How would your answers to the above two questions change, if at all, if your system did not support indexes with multiple-attribute search keys? Answer 16.7 1. A heap ﬁle structure could be used for the relation Parts. A dense unclustered B+Tree index on ( pname, num avail ) and a dense unclustered B+ Tree index on ( cost, pid ) can be created to eﬃciently answers the queries. 2. The problem could be that the optimizer may not be considering the index only plans that could be obtained using the previously described schema. So we can instead create clustered indexes on ( pid, cost ) and ( pname,num avail) . To do this we have to vertically partition the relation into two relations viz. Parts1( pid,cost) and Parts2( pid,pname,num avail). ( If the indexes themselves have not been implemented properly, then we can instead go in for sorted ﬁle organisations for these two split relations )
154
Chapter 16
3. If the multi attribute keys are not allowed then we can have a clustered B+ Tree indexes on cost and on pname on the two relations.
Exercise 16.8 Consider the following BCNF relations, which describe employees and departments that they work in: Emp (eid, sal, did) Dept (did, location, budget) You are told that the following queries are extremely important: Find the location where a user-speciﬁed employee works. Check whether the budget of a department is greater than the salary of each employee in that department. 1. Describe the physical design that you would choose for this relation. That is, what kind of a ﬁle structure would you choose for these relations, and what indexes would you create? 2. Suppose that your customers subsequently complain that performance is still not satisfactory (given the indexes and ﬁle organization that you chose for the relations in response to the previous question). Since you cannot aﬀord to buy new hardware or software, you have to consider a schema redesign. Explain how you would try to obtain better performance by describing the schema for the relation(s) that you would use and your choice of ﬁle organizations and indexes on these relations. 3. Suppose that your database system has very ineﬃcient implementations of index structures. What kind of a design would you try in this case? Answer 16.8 Answer omitted.
Exercise 16.9 Consider the following BCNF relations, which describe departments in a company and employees: Dept(did, dname, location, managerid) Emp(eid, sal) You are told that the following queries are extremely important: List the names and ids of managers for each department in a user-speciﬁed location, in alphabetical order by department name.
Physical Database Design and Tuning
155
Find the average salary of employees who manage departments in a user-speciﬁed location. You can assume that no one manages more than one department. 1. Describe the ﬁle structures and indexes that you would choose. 2. You subsequently realize that updates to these relations are frequent. Because indexes incur a high overhead, can you think of a way to improve performance on these queries without using indexes? Answer 16.9 1. We can go in for heap ﬁle organisation for the 2 relations and have the following indexes. A clustered index on ( location, dname ) ( P.S. We cannot list the names of the persons as there is no name attribute present ) would be useful. We can also have a hash index on eid on the Emp relation to speed up the second query. 2. Yes, we can go in for a horizontal decompostion of the Dept relation based on the location. We can also go in for sorted ﬁle organisations, the relations Dept sorted on dname and Emp on eid .
Exercise 16.10 For each of the following queries, identify one possible reason why an optimizer might not ﬁnd a good plan. Rewrite the query so that a good plan is likely to be found. Any available indexes or known constraints are listed before each query; assume that the relation schemas are consistent with the attributes referred to in the query. 1. An index is available on the age attribute. SELECT E.dno FROM Employee E WHERE E.age=20 OR E.age=10 2. A B+ tree index is available on the age attribute. SELECT E.dno FROM Employee E WHERE E.age<20 AND E.age>10 3. An index is available on the age attribute. SELECT E.dno FROM Employee E WHERE 2*E.age<20 4. No indexes are available.
156
SELECT DISTINCT * FROM Employee E 5. No indexes are available. SELECT FROM GROUP BY HAVING AVG (E.sal) Employee E E.dno E.dno=22
Chapter 16
6. sid in Reserves is a foreign key that refers to Sailors. SELECT FROM WHERE S.sid Sailors S, Reserves R S.sid=R.sid
Answer 16.10 Answer omitted.
Exercise 16.11 Consider the following two ways of computing the names of employees who earn more than $100,000 and whose age is equal to their manager’s age. First, a nested query: SELECT FROM WHERE E1.ename Emp E1 E1.sal > 100 AND E1.age = ( SELECT E2.age FROM Emp E2, Dept D2 WHERE E1.dname = D2.dname AND D2.mgr = E2.ename )
Second, a query that uses a view deﬁnition: SELECT FROM WHERE E1.ename Emp E1, MgrAge A E1.dname = A.dname AND E1.sal > 100 AND E1.age = A.age
CREATE VIEW MgrAge (dname, age) AS SELECT D.dname, E.age FROM Emp E, Dept D WHERE D.mgr = E.ename 1. Describe a situation in which the ﬁrst query is likely to outperform the second query.
Physical Database Design and Tuning
157
2. Describe a situation in which the second query is likely to outperform the ﬁrst query. 3. Can you construct an equivalent query that is likely to beat both these queries when every employee who earns more than $100,000 is either 35 or 40 years old? Explain brieﬂy. Answer 16.11 1. Consider the case when there are very few or no employees having salary more than 100K. Then in the ﬁrst query the nested part would not be computed (due to short circuit evaluation) whereas in the second query the join of Emp and MgrAge would be computed irrespective of the number of Employees with sal > 100K. 2. In the case when there are a large number of employees with sal > 100K and the Dept relation is large, in the ﬁrst query the join of Dept and Emp would be computed for each tuple in Emp that satisﬁes the condition E1.sal > 100K, whereas in the latter the join is computed only once. 3. In this case the selectivity of age may be very high. So if we have a BTree index on say ( age,sal ), then the following query may perform better. SELECT FROM WHERE E1.ename Emp E1 E1.age=35 AND E1.sal > 100 AND E1.age = ( SELECT E2.age FROM Emp E2, Dept D2 WHERE E1.dname = D2.dname AND D2.mgr = E2.ename) UNION SELECT E1.ename FROM Emp E1 WHERE E1.age = 40 AND E1.sal > 100 AND E1.age = ( SELECT E2.age FROM Emp E2, Dept D2 WHERE E1.dname = D2.dname AND D2.mgr = E2.ename)
17
SECURITY
Exercise 17.1 Brieﬂy answer the following questions based on this schema: Emp(eid: integer, ename: string, age: integer, salary: real) Works(eid: integer, did: integer, pct time: integer) Dept(did: integer, budget: real, managerid: integer) 1. Suppose you have a view SeniorEmp deﬁned as follows: CREATE VIEW SeniorEmp (sname, sage, salary) AS SELECT E.ename, E.age, E.salary FROM Emp E WHERE E.age > 50 Explain what the system will do to process the following query: SELECT S.sname FROM SeniorEmp S WHERE S.salary > 100,000 2. Give an example of a view on Emp that could be automatically updated by updating Emp. 3. Give an example of a view on Emp that would be impossible to update (automatically) and explain why your example presents the update problem that it does. 4. Consider the following view deﬁnition: CREATE VIEW DInfo (did, manager, numemps, totsals) AS SELECT D.did, D.managerid, COUNT (*), SUM (E.salary) FROM Emp E, Works W, Dept D WHERE E.eid = W.eid AND W.did = D.did GROUP BY D.did, D.managerid 158
Security
159
(a) Give an example of a view update on DInfo that could (in principle) be implemented automatically by updating one or more of the relations Emp, Works, and Dept. Does SQL-92 allow such a view update? (b) Give an example of a view update on DInfo that cannot (even in principle) be implemented automatically by updating one or more of the relations Emp, Works, and Dept. Explain why. (c) How could the view DInfo help in enforcing security? Answer 17.1 The answers are given below: 1. The system will do the following: SELECT FROM S.name ( SELECT E.ename AS name, E.age, E.salary FROM Emp E WHERE E.age > 50 ) AS S S.salary > 100000
WHERE
2. The following view on Emp can be updated automatically by updating Emp: CREATE VIEW SeniorEmp (eid, name, age, salary) AS SELECT E.eid, E.ename, E.age, E.salary FROM Emp E WHERE E.age > 50 3. The following view cannot be updated automatically because it is not clear which employee records will be aﬀected by a given update: CREATE VIEW AvgSalaryByAge (age, avgSalary) AS SELECT E.eid, AVG (E.salary) FROM Emp E GROUP BY E.age
4. (a) If DInfo.manager is updated, it could, in principle, be implemented automatically by updating the Dept relation to reﬂect a change in the manager of department DInfo.did. However, since SQL/92 does not allow an update on a view deﬁnition based on more than one base relation, this view update is not allowed. (b) If DInfo.totsals is updated, this change cannot be implemented automatically at all because it is not clear which of the employees’ salary ﬁelds need to be changed.
160
Chapter 17
(c) Views are an important component of the security mechanisms provided by a relational DBMS. By deﬁning views on the base relations, we can present needed information to a user while hiding other information that perhaps the user should not be given a ccess to. As an example the chairman of a company might want his secretary to be able to look at the total salaries given to a department under him, but not at the individual salaries of the employees working in those departments. This view deﬁnition would be useful in that case and provides a layer of security that prevents the secretary from viewing or changing the salaries of the employees.
Exercise 17.2 You are the DBA for the VeryFine Toy Company, and you create a relation called Employees with ﬁelds ename, dept, and salary. For authorization reasons, you also deﬁne views EmployeeNames (with ename as the only attribute) and DeptInfo with ﬁelds dept and avgsalary. The latter lists the average salary for each department. 1. Show the view deﬁnition statements for EmployeeNames and DeptInfo. 2. What privileges should be granted to a user who needs to know only average department salaries for the Toy and CS departments? 3. You want to authorize your secretary to ﬁre people (you’ll probably tell him whom to ﬁre, but you want to be able to delegate this task), to check on who is an employee, and to check on average department salaries. What privileges should you grant? 4. Continuing with the preceding scenario, you don’t want your secretary to be able to look at the salaries of individuals. Does your answer to the previous question ensure this? Be speciﬁc: Can your secretary possibly ﬁnd out salaries of some individuals (depending on the actual set of tuples), or can your secretary always ﬁnd out the salary of any individual that he wants to? 5. You want to give your secretary the authority to allow other people to read the EmployeeNames view. Show the appropriate command. 6. Your secretary deﬁnes two new views using the EmployeeNames view. The ﬁrst is called AtoRNames and simply selects names that begin with a letter in the range A to R. The second is called HowManyNames and counts the number of names. You are so pleased with this achievement that you decide to give your secretary the right to insert tuples into the EmployeeNames view. Show the appropriate command, and describe what privileges your secretary has after this command is executed. 7. Your secretary allows Todd to read the EmployeeNames relation and later quits. You then revoke the secretary’s privileges. What happens to Todd’s privileges?
Security
161
8. Give an example of a view update on the above schema that cannot be implemented through updates to Employees. 9. You decide to go on an extended vacation, and to make sure that emergencies can be handled, you want to authorize your boss Joe to read and modify the Employees relation and the EmployeeNames relation (and Joe must be able to delegate authority, of course, since he’s too far up the management hierarchy to actually do any work). Show the appropriate SQL statements. Can Joe read the DeptInfo view? 10. After returning from your (wonderful) vacation, you see a note from Joe, indicating that he authorized his secretary Mike to read the Employees relation. You want to revoke Mike’s SELECT privilege on Employees, but you don’t want to revoke the rights that you gave to Joe, even temporarily. Can you do this in SQL? 11. Later you realize that Joe has been quite busy. He has deﬁned a view called AllNames using the view EmployeeNames, deﬁned another relation called StaﬀNames that he has access to (but that you can’t access), and given his secretary Mike the right to read from the AllNames view. Mike has passed this right on to his friend Susan. You decide that even at the cost of annoying Joe by revoking some of his privileges, you simply have to take away Mike and Susan’s rights to see your data. What REVOKE statement would you execute? What rights does Joe have on Employees after this statement is executed? What views are dropped as a consequence? Answer 17.2 Answer omitted.
Exercise 17.3 Brieﬂy answer the following questions. 1. Explain the intuition behind the two rules in the Bell-LaPadula model for mandatory access control. 2. Give an example of how covert channels can be used to defeat the Bell-LaPadula model. 3. Give an example of polyinstantiation. 4. Describe a scenario in which mandatory access controls prevent a breach of security that cannot be prevented through discretionary controls. 5. Describe a scenario in which discretionary access controls are required to enforce a security policy that cannot be enforced using only mandatory controls. 6. If a DBMS already supports discretionary and mandatory access controls, is there a need for encryption? 7. Explain the need for each of the following limits in a statistical database system:
162
(a) A maximum on the number of queries a user can pose.
Chapter 17
(b) A minimum on the number of tuples involved in answering a query. (c) A maximum on the intersection of two queries (i.e., on the number of tuples that both queries examine). 8. Explain the use of an audit trail, with special reference to a statistical database system. 9. What is the role of the DBA with respect to security? 10. What is public-key encryption? How does it diﬀer from the encryption approach taken in the Data Encryption Standard (DES), and in what ways is it better than DES? 11. What are one-way functions, and what role do they play in public-key encryption? 12. Explain how a company oﬀering services on the Internet could use public-key encryption to make its order-entry process secure. Describe how you would use DES encryption for the same purpose, and contrast the public-key and DES approaches. Answer 17.3 Answers not available.
18
TRANSACTION MANAGEMENT OVERVIEW
Exercise 18.1 Give brief answers to the following questions: 1. What is a transaction? In what ways is it diﬀerent from an ordinary program (in a language such as C)? 2. Deﬁne these terms: atomicity, consistency, isolation, durability, schedule, blind write, dirty read, unrepeatable read, serializable schedule, recoverable schedule, avoids-cascading-aborts schedule. 3. Describe Strict 2PL. Answer 18.1 1. A transaction is an execution of a user program, and is seen by the DBMS as a series, or list, of actions. The actions that can be executed by a transaction include reads and writes of database objects. 2. Blind write: a transaction writes to an object without ever reading the object. A transaction reads a database object that has been modiﬁed by another not-yetcommitted transaction. Such a read is called a dirty read. Suppose a transaction T2 changes the value of an object A that has been read by a transaction T1, while T1 is still in progress. If now T1 tries to read the value of A again, it will get a diﬀerent result, even though it has not modiﬁed A in the meantime. Such a read is called an unrepeatable read. A serializable schedule over a set S of transactions is a schedule whose eﬀect on any consistent database instance is identical to that of some complete serial schedule over the set of committed transactions in S. A schedule is conﬂict serializable if it is conﬂict equivalent to some serial schedule. (Two schedules are said to be conﬂict equivalent if they involve the same set of actions of the same transactions, and they order every pair of conﬂicting actions of two committed transactions in the same way.) 163
164
Chapter 18
A recoverable schedule is one in which transactions commit only after all transactions whose changes they read commit. If transactons only read the changes of committed transactions, the schedule is not only recoverable, aborting a transaction can be accomplished without cascading the abort to other transactions. Such a schedule is said to be avoid cascading aborts schedule. A schedule is view serializable if it is view equivalent to some serial schedule. Two schedules S1 and S2 over the same set of transactions are view equivalent if: If T i reads the initial value of object A in S1, it must also read the initial value of A in S2. If T i reads a value of A written by T j in S1, it must also read the value of A written by T j in S2. For each data object A, the transaction (if any) that performs the ﬁnal write on A in S1 must also perform the ﬁnal write on A in S2.
3. 2PL: a way DBMS to ensure only serializable schedules are allowed. The rules are: Each transaction must get an S-lock on an object before reading it; Each transaction must get an X-lock on an object before writing it; Once a transaction releases a lock, it can not acquire any new locks. Strict 2PL: besides all the rules mentioned above, there is one additional rule to ensure only ’safe’ interleavings of transactions are allowed, which is: All locks held by a transaction are released when the transaction is completed. Conservative 2PL is a variant of 2PL. Under this protocol, a transaction obtains all the locks that it will ever need when it begins, or blocks waiting for all these locks to become availabe. This scheme ensures that there will not be any deadlocks by removing one of the 4 necessary conditions for deadlock – hold while waiting. 4. Phantom problem is the situation that a transaction retrieves a collection of objects twice and sees diﬀerent results, even though it does not modify any of these objects itself and it follows the strict 2PL protocol. Phantom problem is a speciﬁc problem to dynamic database, so it can not occur in a database where the set of database objects is ﬁxed and only the values of objects can be changed. 5. This is due to the multiprogramming and time-sharing nature of modern operating system. If the lock and unlock are not atomic operations, two or more transactions can hold conﬂicting locks on the same object! For example, when one transaction
Transaction Management Overview
165
requests an exclusive lock, the lock manager checks and ﬁnds that no other transaction holds a lock on the object, therefore grants the request; If this thread is descheduled before it has a chance to mark the lock as taken , another transaction could start to run. This second transaction might have done the same thing and have been granted a conﬂicting lock. To prevent this, the entire sequence of lock and unlock must be atomic operations. 6. When timestamps are used for deadlock prevention, if a transaction is aborted and re-started, it is given the same timestamp that it had originally; On the other hand when timestamps are used for concurrency control, if a transaction is aborted and restarted, it is given a new, larger timestamp. 7. To understand and justify Thomas’ Write Rule fully, we need to give the complete context when it arises. To implement timestamp-based concurrency control scheme, the following regulations are made when transaction T wants to write object O: (a) If T S(T ) < RT S(O), the write action conﬂicts with the most recent read action of O, and T is therefore aborted and restarted. (b) If T S(T ) < W T S(O), a naive approach would be to abort T as well because its write action conﬂicts with the most recent write of O, and is out of timestamp order. But it turns out that we can safely ignore such previous write and process with this new write; this is called Thomas’ Write Rule. (c) Otherwise, T writes O and WTS(O) is set to TS(T). The justiﬁcation is as follows: had T S(T ) < RT S(O), T would have been aborted and we would not have bothered to check the WTS(O). So to decide whether to abort T based on WTS(O), we can assume that T S(T ) >= RT S(O). If T S(T ) >= RT S(O) and T S(T ) < W T S(O), then RT S(O) < W T S(O), which means the previous write occurred immediately before this planned-new-write of O and was never read by anyone, therefore the previous write can be safely ignored.
Exercise 18.2 Consider the following actions taken by transaction T 1 on database objects X and Y : R(X), W(X), R(Y), W(Y)
1. Give an example of another transaction T 2 that, if run concurrently to transaction T without some form of concurrency control, could interfere with T 1. 2. Explain how the use of Strict 2PL would prevent interference between the two transactions.
166
Chapter 18
3. Strict 2PL is used in many database systems. Give two reasons for its popularity. Answer 18.2 Answer omitted. Exercise 18.3 Consider a database with objects X and Y and assume that there are two transactions T 1 and T 2. Transaction T 1 reads objects X and Y and then writes object X. Transaction T 2 reads objects X and Y and then writes objects X and Y . 1. Give an example schedule with actions of transactions T 1 and T 2 on objects X and Y that results in a write-read conﬂict. 2. Give an example schedule with actions of transactions T 1 and T 2 on objects X and Y that results in a read-write conﬂict. 3. Give an example schedule with actions of transactions T 1 and T 2 on objects X and Y that results in a write-write conﬂict. 4. For each of the three schedules, show that Strict 2PL disallows the schedule. Answer 18.3 Answer not available yet. Exercise 18.4 Consider the following (incomplete) schedule S: T1:R(X), T1:R(Y), T1:W(X), T2:R(Y), T3:W(Y), T1:W(X), T2:R(Y) 1. Can you determine the serializability graph for this schedule? Assuming that all three transactions eventually commit, show the serializability graph. 2. For each of the following, modify S to create a complete schedule that satisﬁes the stated condition. If a modiﬁcation is not possible, explain brieﬂy. If it is possible, use the smallest possible number of actions (read, write, commit, or abort). You are free to add new actions anywhere in the schedule S, including in the middle. (a) Resulting schedule avoids cascading aborts but is not recoverable. (b) Resulting schedule is recoverable. (c) Resulting schedule is conﬂict-serializable. Answer 18.4 Answer omitted. Exercise 18.5 Suppose that a DBMS recognizes increment, which increments an integer-valued object by 1, and decrement as actions, in addition to reads and writes. A transaction that increments an object need not know the value of the object; increment and decrement are versions of blind writes. In addition to shared and exclusive locks, two special locks are supported: An object must be locked in I mode before incrementing it and locked in D mode before decrementing it. An I lock is compatible with another I or D lock on the same object, but not with S and X locks.
Transaction Management Overview
167
1. Illustrate how the use of I and D locks can increase concurrency. (Show a schedule allowed by Strict 2PL that only uses S and X locks. Explain how the use of I and D locks can allow more actions to be interleaved, while continuing to follow Strict 2PL.) 2. Informally explain how Strict 2PL guarantees serializability even in the presence of I and D locks. (Identify which pairs of actions conﬂict, in the sense that their relative order can aﬀect the result, and show that the use of S, X, I, and D locks according to Strict 2PL orders all conﬂicting pairs of actions to be the same as the order in some serial schedule.) Answer 18.5 1. Take the following two transactions as example:
T1: Increment A, Decrement B, Read C; T2: Increment B, Decrement A, Read C If using only strict 2PL, all actions are versions of blind writes, they have to obtain X-lock on objects. Following strict 2PL, T1 gets X-Lock on A, if T2 now gets X-Lock on B, there will be deadlock. Even if T1 is fast enough to have grabbed X-Lock on B ﬁrst, T2 will now be blocked till T1 ﬁnishes. This has little concurrency. If I and D locks are used, since I and D are compatible, T1 obtains I-Lock on A, D-Lock on B; T2 can still obtain I-Lock on B, D-Lock on A; both transactions can be interleaved to allow maximum concurrency. 2. The pairs of actions which conﬂicts are: RW, WW, WR, IR, IW, DR, DW We know that strict 2PL orders the ﬁrst 3 conﬂicts pairs of actions to be the same as the order in some serial schedule. We can also show that even in the presence of I and D locks, strict 2PL also order the latter 4 pairs of actions to be the same as the order in some serial schedule. The easiest way to think that I(or D) lock under these circumstances as an exclusive lock, since I(D) lock is not compatible with S and X locks anyway.
19
CONCURRENCY CONTROL
Exercise 19.1 1. Deﬁne these terms: conﬂict-serializable schedule, view-serializable schedule, strict schedule. 2. Describe each of the following locking protocols: 2PL, Conservative 2PL. 3. Why must lock and unlock be atomic operations? 4. What is the phantom problem? Can it occur in a database where the set of database objects is ﬁxed and only the values of objects can be changed? 5. Identify one diﬀerence in the timestamps assigned to restarted transactions when timestamps are used for deadlock prevention versus when timestamps are used for concurrency control. 6. State and justify the Thomas Write Rule. Answer 19.1 1. A transaction is an execution of a user program, and is seen by the DBMS as a series, or list, of actions. The actions that can be executed by a transaction include reads and writes of database objects. 2. Blind write: a transaction writes to an object without ever reading the object. A transaction reads a database object that has been modiﬁed by another not-yetcommitted transaction. Such a read is called a dirty read. Suppose a transaction T2 changes the value of an object A that has been read by a transaction T1, while T1 is still in progress. If now T1 tries to read the value of A again, it will get a diﬀerent result, even though it has not modiﬁed A in the meantime. Such a read is called an unrepeatable read. A serializable schedule over a set S of transactions is a schedule whose eﬀect on any consistent database instance is identical to that of some complete serial schedule over the set of committed transactions in S.
168
Concurrency Control
169
A schedule is conﬂict serializable if it is conﬂict equivalent to some serial schedule. (Two schedules are said to be conﬂict equivalent if they involve the same set of actions of the same transactions, and they order every pair of conﬂicting actions of two committed transactions in the same way.) A recoverable schedule is one in which transactions commit only after all transactions whose changes they read commit. If transactons only read the changes of committed transactions, the schedule is not only recoverable, aborting a transaction can be accomplished without cascading the abort to other transactions. Such a schedule is said to be avoid cascading aborts schedule. A schedule is view serializable if it is view equivalent to some serial schedule. Two schedules S1 and S2 over the same set of transactions are view equivalent if: If T i reads the initial value of object A in S1, it must also read the initial value of A in S2. If T i reads a value of A written by T j in S1, it must also read the value of A written by T j in S2. For each data object A, the transaction (if any) that performs the ﬁnal write on A in S1 must also perform the ﬁnal write on A in S2. 3. 2PL: a way DBMS to ensure only serializable schedules are allowed. The rules are: Each transaction must get an S-lock on an object before reading it; Each transaction must get an X-lock on an object before writing it; Once a transaction releases a lock, it can not acquire any new locks. Strict 2PL: besides all the rules mentioned above, there is one additional rule to ensure only ’safe’ interleavings of transactions are allowed, which is: All locks held by a transaction are released when the transaction is completed. Conservative 2PL is a variant of 2PL. Under this protocol, a transaction obtains all the locks that it will ever need when it begins, or blocks waiting for all these locks to become availabe. This scheme ensures that there will not be any deadlocks by removing one of the 4 necessary conditions for deadlock – hold while waiting. 4. Phantom problem is the situation that a transaction retrieves a collection of objects twice and sees diﬀerent results, even though it does not modify any of these objects itself and it follows the strict 2PL protocol. Phantom problem is a speciﬁc problem to dynamic database, so it can not occur in a database where the set of database objects is ﬁxed and only the values of objects can be changed.
170
Chapter 19
5. This is due to the multiprogramming and time-sharing nature of modern operating system. If the lock and unlock are not atomic operations, two or more transactions can hold conﬂicting locks on the same object! For example, when one transaction requests an exclusive lock, the lock manager checks and ﬁnds that no other transaction holds a lock on the object, therefore grants the request; If this thread is descheduled before it has a chance to mark the lock as taken , another transaction could start to run. This second transaction might have done the same thing and have been granted a conﬂicting lock. To prevent this, the entire sequence of lock and unlock must be atomic operations. 6. When timestamps are used for deadlock prevention, if a transaction is aborted and re-started, it is given the same timestamp that it had originally; On the other hand when timestamps are used for concurrency control, if a transaction is aborted and restarted, it is given a new, larger timestamp. 7. To understand and justify Thomas’ Write Rule fully, we need to give the complete context when it arises. To implement timestamp-based concurrency control scheme, the following regulations are made when transaction T wants to write object O: (a) If T S(T ) < RT S(O), the write action conﬂicts with the most recent read action of O, and T is therefore aborted and restarted. (b) If T S(T ) < W T S(O), a naive approach would be to abort T as well because its write action conﬂicts with the most recent write of O, and is out of timestamp order. But it turns out that we can safely ignore such previous write and process with this new write; this is called Thomas’ Write Rule. (c) Otherwise, T writes O and WTS(O) is set to TS(T). The justiﬁcation is as follows: had T S(T ) < RT S(O), T would have been aborted and we would not have bothered to check the WTS(O). So to decide whether to abort T based on WTS(O), we can assume that T S(T ) >= RT S(O). If T S(T ) >= RT S(O) and T S(T ) < W T S(O), then RT S(O) < W T S(O), which means the previous write occurred immediately before this planned-new-write of O and was never read by anyone, therefore the previous write can be safely ignored.
Exercise 19.2 Consider the following classes of schedules: serializable, conﬂict-serializable, view-serializable, recoverable, avoids-cascading-aborts, and strict. For each of the following schedules, state which of the above classes it belongs to. If you cannot decide whether a schedule belongs in a certain class based on the listed actions, explain brieﬂy.
Concurrency Control
171
The actions are listed in the order they are scheduled, and preﬁxed with the transaction name. If a commit or abort is not shown, the schedule is incomplete; assume that abort/commit must follow all the listed actions. 1. T1:R(X), T2:R(X), T1:W(X), T2:W(X) 2. T1:W(X), T2:R(Y), T1:R(Y), T2:R(X) 3. T1:R(X), T2:R(Y), T3:W(X), T2:R(X), T1:R(Y) 4. T1:R(X), T1:R(Y), T1:W(X), T2:R(Y), T3:W(Y), T1:W(X), T2:R(Y) 5. T1:R(X), T2:W(X), T1:W(X), T2:Abort, T1:Commit 6. T1:R(X), T2:W(X), T1:W(X), T2:Commit, T1:Commit 7. T1:W(X), T2:R(X), T1:W(X), T2:Abort, T1:Commit 8. T1:W(X), T2:R(X), T1:W(X), T2:Commit, T1:Commit 9. T1:W(X), T2:R(X), T1:W(X), T2:Commit, T1:Abort 10. T2: R(X), T3:W(X), T3:Commit, T1:W(Y), T1:Commit, T2:R(Y), T2:W(Z), T2:Commit 11. T1:R(X), T2:W(X), T2:Commit, T1:W(X), T1:Commit, T3:R(X), T3:Commit 12. T1:R(X), T2:W(X), T1:W(X), T3:R(X), T1:Commit, T2:Commit, T3:Commit Answer 19.2 Answer omitted.
Exercise 19.3 Consider the following concurrency control protocols: 2PL, Strict 2PL, Conservative 2PL, Optimistic, Timestamp without the Thomas Write Rule, Timestamp with the Thomas Write Rule, and Multiversion. For each of the schedules in Exercise 19.2, state which of these protocols allows it, that is, allows the actions to occur in exactly the order shown. For the timestamp-based protocols, assume that the timestamp for transaction Ti is i and that a version of the protocol that ensures recoverability is used. Further, if the Thomas Write Rule is used, show the equivalent serial schedule. Answer 19.3 See the table 19.1. Note the following abbreviations. S-2PL: Strict 2PL; C-2PL: Conservative 2PL; Opt cc: Optimistic; TS W/O THR: Timestamp without Thomas Write Rule; TS With THR: Timestamp without Thomas Write Rule.
172
Chapter 19
1 2 3 4 5 6 7 8 9 10 11 12
2PL N Y N N N N N N N N N N
S-2PL N N N N N N N N N N N N
C-2PL N N N N N N N N N N N N
Opt CC N Y Y Y Y N Y N Y N N N
TS w/o TWR N Y N N N N N N N Y N N
TS w/ TWR N Y N N Y Y N N N Y Y Y
Multiv. N Y Y Y Y Y N N N Y N Y
Table 19.1
Thomas Write Rule is used in the following schedules, and the equivalent serial schedules are shown below: 5. T1:R(X), T1:W(X), T2:Abort, T1:Commit 6. T1:R(X), T1:W(X), T2:Commit, T1:Commit 11. T1:R(X), T2:Commit, T1:W(X), T2:Commit, T3:R(X), T3:Commit
Exercise 19.4 Consider the following sequences of actions, listed in the order they are submitted to the DBMS: Sequence S1: T1:R(X), T2:W(X), T2:W(Y), T3:W(Y), T1:W(Y), T1:Commit, T2:Commit, T3:Commit Sequence S2: T1:R(X), T2:W(Y), T2:W(X), T3:W(Y), T1:W(Y), T1:Commit, T2:Commit, T3:Commit For each sequence and for each of the following concurrency control mechanisms, describe how the concurrency control mechanism handles the sequence. Assume that the timestamp of transaction Ti is i. For lock-based concurrency control mechanisms, add lock and unlock requests to the above sequence of actions as per the locking protocol. The DBMS processes actions in the order shown. If a transaction is blocked, assume that all of its actions are queued until it is resumed; the DBMS continues with the next action (according to the listed sequence) of an unblocked transaction.
Concurrency Control
173
1 2 3 4
Serializab. No No Yes Yes
Conﬂict-serializab. No No Yes Yes
Recoverab. No Yes Yes Yes
Avoid cascading abort No Yes Yes Yes
Table 19.2
1. Strict 2PL with timestamps used for deadlock prevention. 2. Strict 2PL with deadlock detection. (Show the waits-for graph if a deadlock cycle develops.) 3. Conservative (and strict, i.e., with locks held until end-of-transaction) 2PL. 4. Optimistic concurrency control. 5. Timestamp concurrency control with buﬀering of reads and writes (to ensure recoverability) and the Thomas Write Rule. 6. Multiversion concurrency control. Answer 19.4 Answer omitted.
Exercise 19.5 For each of the following locking protocols, assuming that every transaction follows that locking protocol, state which of these desirable properties are ensured: serializability, conﬂict-serializability, recoverability, avoid cascading aborts. 1. Always obtain an exclusive lock before writing; hold exclusive locks until end-oftransaction. No shared locks are ever obtained. 2. In addition to (1), obtain a shared lock before reading; shared locks can be released at any time. 3. As in (2), and in addition, locking is two-phase. 4. As in (2), and in addition, all locks held until end-of-transaction. Answer 19.5 See the table 19.2.
Exercise 19.6 The Venn diagram (from [76]) in Figure 19.1 shows the inclusions between several classes of schedules. Give one example schedule for each of the regions S1 through S12 in the diagram.
174
Chapter 19
All Schedules View Serializable Conflict Serializable
S1 S4 S7
S2 S5 S8
S3 S6 S9
Recoverable Avoid Cascading Abort Strict
S10 S11 S12
Serial
Figure 19.1
Venn Diagram for Classes of Schedules
Answer 19.6 Answer omitted.
Exercise 19.7 Brieﬂy answer the following questions: 1. Draw a Venn diagram that shows the inclusions between the classes of schedules permitted by the following concurrency control protocols: 2PL, Strict 2PL, Conservative 2PL, Optimistic, Timestamp without the Thomas Write Rule, Timestamp with the Thomas Write Rule, and Multiversion. 2. Give one example schedule for each region in the diagram. 3. Extend the Venn diagram to include the class of serializable and conﬂict-serializable schedules. Answer 19.7 1. See ﬁgure 19.2.
2. (a) Here we deﬁne the following schedule ﬁrst: C1: T0:R(O),T0:Commit. C2: T1:Begin,T2:Begin,T1:W(A),T1:Commit,T2:R(A),T2:Commit. C3: T4:Begin,T3:Begin,T3:W(B),T3:Commit,T4:W(B),T4:Abort. C4: T4:Begin,T3:Begin,T3:W(B),T3:Commit,T4:R(B),T4:Abort. C5: T3:Begin,T4:Begin,T4:R(B),T4:Commit,T3:W(B),T3:Commit. C6: T5:Begin,T6:Begin,T6:R(D),T5:R(C),T5:Commit, T6:W(C),T6:Commit. vii. C7: T5:Begin,T6:Begin,T6:R(D),T5:R(C),T6:W(C), T5:Commit,T6:Commit. viii. C8: T5:Begin,T6:Begin,T5:R(C),T6:W(C),T5:R(D), T5:Commit,T6:Commit. i. ii. iii. iv. v. vi.
Concurrency Control
175
Optimistic S17 S21 S25 S29 S18 S22 S26 S30 S19 S23 S27 S31 S20 S24 S28 S32
Conservative 2PL S13 Strict 2PL S9 2PL S5 S1 S14 S10 S6 S2 S15 S11 S7 S3 S16 S12 S8 S4 Timestamp W/O TWR Timestamp With TWR Multiversion
Figure 19.2
Then we have the following schedule for each region in the diagram.(Please note, S1: C2,C5,C8 means that S1 is the combination of schedule C2,C5,C8.) i. ii. iii. iv. v. vi. vii. viii. ix. x. xi. xii. xiii. xiv. xv. xvi. S1: C2,C5,C8 S2: C2,C4,C8 S3: C2,C3,C8 S4: C2,C8 S5: C2,C5,C7 S6: C2,C4,C7 S7: C2,C3,C7 S8: C2,C7 S9: C2,C5,C6 S10: C2,C4,C6 S11: C2,C3,C6 S12: C2,C6 S13: C2,C5 S14: C2,C4 S15: C2,C3 S16: C2,C1
And for the rest of 16 schedules, just remove the C2 from the corresponding schedule.(eg, S17: C5,C8, which is made by removing C2 from S1.) 3. See ﬁgure 19.3.
176
Chapter 19
Optimistic
Conservative 2PL Strict 2PL 2PL
Timestamp W/O TWR Timestamp With TWR Multiversion Conflict-serializable Serializable
Figure 19.3
Exercise 19.8 Answer each of the following questions brieﬂy. The questions are based on the following relational schema: Emp(eid: integer, ename: string, age: integer, salary: real, did: integer) Dept(did: integer, dname: string, ﬂoor: integer) and on the following update command: replace (salary = 1.1 * EMP.salary) where EMP.ename = ‘Santa’ 1. Give an example of a query that would conﬂict with this command (in a concurrency control sense) if both were run at the same time. Explain what could go wrong, and how locking tuples would solve the problem. 2. Give an example of a query or a command that would conﬂict with this command, such that the conﬂict could not be resolved by just locking individual tuples or pages, but requires index locking. 3. Explain what index locking is and how it resolves the preceding conﬂict. Answer 19.8 Answer omitted. Exercise 19.9 SQL-92 supports four isolation-levels and two access-modes, for a total of eight combinations of isolation-level and access-mode. Each combination implicitly deﬁnes a class of transactions; the following questions refer to these eight classes.
Concurrency Control
177
1. For each of the eight classes, describe a locking protocol that allows only transactions in this class. Does the locking protocol for a given class make any assumptions about the locking protocols used for other classes? Explain brieﬂy. 2. Consider a schedule generated by the execution of several SQL transactions. Is it guaranteed to be conﬂict-serializable? to be serializable? to be recoverable? 3. Consider a schedule generated by the execution of several SQL transactions, each of which has READ ONLY access-mode. Is it guaranteed to be conﬂict-serializable? to be serializable? to be recoverable? 4. Consider a schedule generated by the execution of several SQL transactions, each of which has SERIALIZABLE isolation-level. Is it guaranteed to be conﬂictserializable? to be serializable? to be recoverable? 5. Can you think of a timestamp-based concurrency control scheme that can support the eight classes of SQL transactions? Answer 19.9 1. For the class SERIALIZABLE, REPEATABLE READ and READ COMMITTED, they rely on the assumption that other classes obtain exclusive locks before writing objects and hold exclusive locks until the end of the transaction. (a) SERIALIZABLE+RO: Strict 2PL including locks on set of objects that it requires to be unchanged. No exclusive locks are granted. (b) SERIALIZABLE+RW: Strict 2PL including locks on set of objects that it requires to be unchanged. (c) REPEATABLE READ+RO: Strict 2PL, only locks individual objects, not sets of objects. No exclusive locks are granted. (d) REPEATABLE READ+RW: Strict 2PL, only locks individual objects, not sets of objects. (e) READ COMMITTED+RO: Obtains shared locks before reading objects, but these locks are released immediately. (f) READ COMMITTED+RW: Obtains exclusive locks before writing objects, and hold these locks until the end. Obtains shared locks before reading objects, but these locks are released immediately. (g) READ UNCOMMITTED+RO: Do not obtain shared locks before reading objects. (h) READ UNCOMMITTED+RW: Obtains exclusive locks before writing objects, and hold these locks until the end. Does not obtain shared locks before reading objects. 2. Suppose we do not have any requirements for the access-mode and isolation-level of the transaction, then they are not guarenteed to be conﬂict-serializable, serializable, and recoverable.
178
Chapter 19
3. For the schedule generated by the execution of several SQL transactions, each of which has READ ONLY access-mode. It is guaranteed to be conﬂict-serializable, serializable, and recoverable. 4. For the schedule generated by the execution of several SQL transactions, each of which has SERIALIZABLE isolation-level. It is guaranteed to be conﬂict-serializable, serializable, and recoverable. 5. To be ﬁnished.
Exercise 19.10 Consider the tree shown in Figure 19.5. Describe the steps involved in executing each of the following operations according to the tree-index concurrency control algorithm discussed in Section 19.3.2, in terms of the order in which nodes are locked, unlocked, read and written. Be speciﬁc about the kind of lock obtained and answer each part independently of the others, always starting with the tree shown in Figure 19.5. 1. Search for data entry 40*. 2. Search for all data entries k∗ with k ≤ 40. 3. Insert data entry 62*. 4. Insert data entry 40*. 5. Insert data entries 62* and 75*. Answer 19.10 Answer omitted.
Exercise 19.11 Consider a database that is organized in terms of the following hierarachy of objects: The database itself is an object (D), and it contains two ﬁles (F 1 and F 2), each of which contains 1000 pages (P 1 . . . P 1000 and P 1001 . . . P 2000, respectively). Each page contains 100 records, and records are identiﬁed as p : i, where p is the page identiﬁer and i is the slot of the record on that page. Multiple-granularity locking is used, with S, X, IS, IX and SIX locks, and databaselevel, ﬁle-level, page-level and record-level locking. For each of the following operations, indicate the sequence of lock requests that must be generated by a transaction that wants to carry out (just) these operations: 1. Read record P 1200 : 5. 2. Read records P 1200 : 98 through P 1205 : 2. 3. Read all (records on all) pages in ﬁle F 1.
Concurrency Control
4. Read pages P 500 through P 520. 5. Read pages P 10 through P 980.
179
6. Read all pages in F 1 and modify about 10 pages, which can be identiﬁed only after reading F 1. 7. Delete record P 1200 : 98. (This is a blind write.) 8. Delete the ﬁrst record from each page. (Again, these are blind writes.) 9. Delete all records. Answer 19.11 1. IS on D; IS on F2; IS on P1200; S on P1200:5.
2. IS on D; IS on F2; IS on P1200, S on P1201 through P1204, IS on P1205; S on P1200:98/99/100, S on P1205:1/2. 3. IS on D; S on F1 4. IS on D; IS on F1; S on P500 through P520. 5. IS on D; S on F1. 6. IS and IX on D; SIX on F1. 7. IX on D; IX on F2; X on P1200. (Locking the whole page is not neccessary, but it would some reorganization or compaction.) 8. IX on D; X on F1 and F2. (There are many ways to do this, there is a tradeoﬀ between overhead and concurrency.) 9. IX on D; X on F1 and F2.
20
CRASH RECOVERY
Exercise 20.1 Brieﬂy answer the following questions: 1. How does the recovery manager ensure atomicity of transactions? How does it ensure durability? 2. What is the diﬀerence between stable storage and disk? 3. What is the diﬀerence between a system crash and a media failure? 4. Explain the WAL protocol. 5. Describe the steal and no-force policies. Answer 20.1 1. The Recovery Manager ensures atomicity of transactions by undoing the actions of transactions that do not commit. It ensures durability by making sure that all actions of committed transactions survive system crashes and media fail...