This preview has intentionally blurred parts. Sign up to view the full document

View Full Document

Unformatted Document Excerpt

development Historical of operating systems: Different decades brought different generations of operating systems. In the 40s, there were no operating systems. In the 50s, batch operating systems were developed to smooth the transition between jobs. In the 60s, multistream batch systems and timesharing systems were developed. Interactive systems Turnaround time vs. response time. Virtual storage. In the 70s, timesharing came of age. Controlled sharing and protection: Multics. Object-oriented systems. The emergence of UNIX. In the 80s, the major influences were the personal computer and parallel processing. Computer networks. Graphical interfaces. Concurrency and distribution of responsiblity. The 90s have seen networking come to the fore. Fileservers. Client-server model. Programs loaded across the Web. Structure of operating systems: An OS manages a variety of hardware and software resources. Processors, memories, I/O devices. Programs and data files. Beautification principlean operating system is a set of algorithms that hide the details of the hardware and provide a more pleasant environment.Finkel. How might an operating system be organized? Lecture 1 Operating System Principles Page 1 Monolithic structure: Many modules, e.g., processor scheduling, memory management, device management, but only one of these can execute at a time. A single process. Control transferred by procedure calls and branches. User programs may be viewed as subroutines that are invoked by the OS. How does the OS regain control? Program terminates (like a return). Program runs out of time. Program requests system service, e.g., I/O. An interrupt occurs, indicating that, e.g., a device requires attention of the OS. A monolithic operating system: User programs p1 Procedure call, SVC, Send/Receive, Return Call, Branch. pn Operating system Interrupt HW As they grow large, monolithic systems become unreliable. The kernel approach: The most vital low-level functions are collected in a kernel, or nucleus. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 2 Other parts of the operating system call kernel functions to perform these low-level tasks. The kernel is protected against other parts of the system. There may be more than one process at a time in a kernel-based OS. (A process can be thought of as a program in execution.) The kernel is responsible for dividing CPU time among all the processes. Some processes are system processes and others are user processes. There must be some way for the processes to interact. User processes System processes Kernel call ui Communication si Return Kernel Interrupt HW Processes can communicate via send and receive primitives. Each primitive specifies a buffer that can hold the message. It also specifies where the message should be taken from or put. Why do we need both of these? Lecture 1 Operating System Principles Page 3 Also, the receiver (or sender) must be specified. E.g., Receive a message from YourProcess, taking the message out of OurBuffer and putting it into variable Msg. Process hierarchy: In a kernel-based OS, everything is either in the kernel or not in the kernel. If the OS is large, the kernel may grow fairly large itself, raising the same problems as before. Therefore, the operating system may be structured in more than two levels. An early example of this was Dijkstras THE operating system. User1 User 2 User n I/O device processes L4: Indep. user processes L3: VIrtual I/O devices L2: Virtual Operator Consoles L1: Virtual Segmented Memory L0: Virtual CPUs Command Interpreter Segment Controller CPU alloc., synchroniztn. CPU Main mem., secondary storage Operators console I/O devices Actual hardware Each layer defines a successively more abstract virtual machine. Each layer performs a resource-management task. Processes further up the hierarchy can use resources handled at lower levels. Processors above level 0 need not be concerned with the number of available physical processors. All processes above level 2 have the illusion of having their own console. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 4 Functional hierarchy: It is difficult to structure an OS strictly in terms of a functional hierarchy. Consider process management and memory management. To create a new process, the process manager calls the memory manager to allocate space for the new process. When memory is full, the memory manager calls the process manager to deactivate a process, thereby freeing memory. It is not always possible to organize processes into a strict hierarchy. Soplace functions, not processes into a hierarchy. Each level consists of a set of functions. The levels L0, L1, , Ln are ordered such that functions defined at level Li are also known to Li+1, and at the discretion of Li+1, also to levels Li+2, etc. L0 corresponds to the hardware instructions of the machine. Each level provides a new computera virtual machineto the next higher level. Example: Memory and process management. Process Process creation/destruction Segment creation/destruction Process scheduling Segment management Process L e v e l s Lecture 1 Operating System Principles Page 5 The functions of process creation and destruction need segment creation and destruction, and thus are placed above the latter. Similarly, process scheduling is placed above segment management. Why are segment creation and destruction are above process scheduling? Object-oriented structure: A process can be considered to be an object. But a process is only one kind of object. Other kinds of objects include files, procedures, and pages of memory. Basic idea : A object-oriented system can be characterized as a collection of abstract entities with certain basic properties and precisely defined interactions. Each part of the system only needs to know the properties of an object, and the operations that may be performed on it in order to be able to use it properly. When an OS is structured according to the object model, each function (whether inside or outside the kernel) has access to only the data that it needs to manipulate. Definition : An operating system (computer) is said to implement the object model if each action performed by the operating system (computer) is the application of some operation to one or more objects. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 6 This is analogous to execution of ordinary instructions, where each instruction operates on a set of operands. The object model is supported by various Languages, such as Smalltalk, C++ and Java. Operating systems, such as CAP, Hydra, STAROS, Monads, iMAX432. Distributed systems frameworks, such as CORBA, OLE, and DSOM. Examples of objects : (a) From real lifean elevator . Properties: weight, volume, location (floor). Operations: call, go up, go down, open door, close door. (b) From application programsa stack . Properties: size (number of items); contents (the items themselves). Operations: push, pop, inspect top element, initialize. (c) From operating systemsa file system (Peterson & Silberschatz, 3.2). Properties Directory structure. Information about existing files (name, type, location, size, current position, protection, usage count, time, etc.). Operations: Create file. Find space for the new file. Enter the file in a directory. Write file. Search directory to find file. From end-of-file pointer, determine location of next block. Write information to next block. Update end-of-file pointer. Lecture 1 Operating System Principles Page 7 Read file (similar to write). Delete file. Search directory to find file. Release space occupied by file. Invalidate directory entry. Processes. Definition : A process is an executing program. More precisely: A process is an entity representing a program in execution. Operations: Create process. Delete process. Suspend process (remove it from the processor; temporarily stop it). Resume process. etc. Processes and process states: A process is an executing program (an entity representing a program in execution). A process is a sequence of actions and is therefore dynamic. A program is a sequence of instructions and is therefore static. Outline for Lecture 2 I. Processes A. Process states B. Representation of a process C. Suspending a process II. Process scheduling A. Schedulers B. Multiplexers A process may be in one of three states : Running. Currently assigned to the (a) processor, which is executing it. Such a process is called the current process. III. Threads A. User-level threads B. Kernel-level threads Ready. The process could run, if (all) the processor(s) were not busy running other processes. A ready process is found on the ready queue. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 8 P7 P12 P5 Blocked. The process cannot run until something happens. For example, it may be awaiting the completion of input, or communication from another process. Such a process will be queued on a particular device or event. P2 P3 P9 LPR P2 P10 Disk Transition diagram : Running Assign Block Preempt Ready Blocked Awaken Representation of a process (S&G, 4.1.3): According to the object model, each object has a representation and a set of operations defined on it. A process is represented by a Process Control Block (PCB). The contents of the PCB vary from one operating system to another. In general, though, the following set of items is representative: Lecture 2 Operating System Principles Process # Proc. status Program ctr. Register save area Memory-management information Accounting information I/O status information Scheduling information Page 9 The process number (also called process name, PID). Uniquely identifies this process. The process status (running, ready, or blocked). Program counter. Address of the next instruction that this process will execute. Register save area. A copy of the information in the CPU registers (programmable and non-programmable) when the process was last blocked or preeempted. Memory-management information. E.g., base/length registers, page table. Accounting information. Amount of time used, time limit, user ID, etc. I/O status information. Outstanding I/O requests, open files, devices allocated, if any. Scheduling information. Pointer to ready queue or device queue, priority, etc. Suspending a process (S&G, 4.1.1): Sometimes it is desirable to suspend a process, i.e., not to consider it for execution for awhile. Usually because the system is too heavily loaded, and there is not enough room for all processes in main memory, or the system is unstable and may crash soon. If the process is running when that happens, it may be difficult or impossible to continue execution after the system comes back up. A process in any state may be suspended. Awaken Ready Suspend Preempt Blocked Assign Resume Running Suspend Resume Suspend Suspended Ready Suspended Blocked Event occurrence 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 10 Here is a more complete list of operations on processes: Create Process Destroy Process Suspend Process Resume Process Change Processs Priority Block Process Awaken Process Assign Process Process scheduling (S&G, 4.2): Why schedule the processor? (Why not just let each process run until its finished?) Well, A typical process runs as follows: Execute Wait for I/O to complete Execute : : Consider two processes A and B that each Execute (1 second) Wait (1 second) : : 60 times If the processor runs A to completion and then B, it takes 4 minutes to complete both processes 2 minutes (total) of processor time used (2 minutes wasted) 50% processor utilization. If we interleave A and B, it takes 2 minutes (1 second) to finish A 2 minutes to finish B 100% processor utilization. While one process waits for I/O, another process can be executed. How is the processor switched between processes? Lecture 2 Operating System Principles Page 11 Process A executes Save state Handle interrupt. Decide which process to run next. Restore state Process B executes Handle interrupt. Decide which process to run next. Restore state Save state Process A executes When is the processor switched between processes? Examples: When the executing process performs a system call. When a higher-priority event needs attention. When a higher-priority process needs to be run. When a timer interrupt occurs. Scheduling queues (S&G, 4.2.1): There are several queues within an operating system. Most of these should be familiar to you from earlier courses. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 12 Ready queue: Device queue: Job queue: Queues are often maintained using what data structure? Schedulers (S&G, 4.2.2): Process scheduling is performed by programs known in various OSs as the scheduler or the multiplexer. Purpose: To allocate the processor among the processes. Often there are two such programs. One (let us call it the multiplexer) is responsible for shortterm scheduling (preemption, activation, awakening) of processes in main memory. The other (let us call it the scheduler) is responsible for long-term scheduling (changing priorities, suspension, resumption). Sometimes it is called the job scheduler. Resumption Assignment Processes awaiting initiation Suspended processes awaiting resumption Initiation Ready processes Running processes Completed processes Completion Suspension Preemption Long-term scheduling Short-term scheduling Lecture 2 Operating System Principles Page 13 Often (in a non-virtual memory operating system), the scheduler determines which processes should be brought into main memory, while the multiplexer determines which process in main memory should be executed next. The scheduler is invoked infrequently (seconds, minutes). It may be slow. The multiplexer is invoked frequently (e.g., at least once every 100 ms.). It must be fast. The scheduler controls the degree of multiprogramming. If the degree of multiprogramming is stable, the scheduler may have to be invoked only when On the next page is a diagram of the progress of a job that is never suspended-ready and only blocks awaiting I/O. Long-term Ready queue Short-term Processor Device queue Perform I/O The multiplexer is entered whenever whenever 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 14 We will assume that the multiplexer is entered after all Operation of the multiplexer : 1. Is the current process still the most suitable to run? If so, return control to it. If not, then 2. Save the volatile state (PC and other processor registers) in the register save area in the PCB. 3. Retrieve the volatile environment of the most suitable process from its PCB. 4. Transfer control to that process (at the location indicated by the restored PC). Steps 2-4 are known as a process switch, Which process is most suitable depends on the scheduling discipline, which we will soon consider. Operations on processes (S&G, 4.3): The OS must provide a way to create and terminate processes. Process creation: The creator process is known as the parent, and the created process is called the child. Regarding the resources of the child process, there are several options: Child obtains resources directly from OS. Child obtains resources from Lecture 2 Operating System Principles Page 15 Child resources with parent. The parent may also pass resources, such as open files, to its children. Regarding concurrency of the child, there are two options. Regarding the address space of the child, there are also two options. In Unix, each process is identified by a pid. A new process is created by the fork system call. It consists of a copy of the parents address space. (Parent and child can thus communicate easily.) Both processes continue execution after the fork. In the child, the return code for the fork is 0. In the parent, the return code for the fork is After creation, a processs address space is normally replaced by an execve system call. It loads an object file into memory, on top of itself, and starts its execution. The parent can create more children, or suspend itself, via a wait system call. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 16 By contrast, VMS creates a new process, and loads a specified program into its address space. Windows/NT supports both models. Process termination: When a process wants to terminate, it issues an exit call. It may return data to its parent process. All its resources are deallocated by the OS. Generally, a parent is allowed to abort its child. Why might it do this? In Unix, a process issues exit to terminate. A parent can wait for this event. wait returns the pid of the exiting child If the parent terminates, all children are terminated too. Threads (S&G, 4.5): Traditionally, a process has consisted of a single address space, and a single thread of control. Each process has its own PCB, and its own memory. This makes it expensive to create and maintain processes. Lecture 2 Operating System Principles Page 17 With the advent of multiprocessors, it may be too expensive to create all the process data structures each time it is desired to execute code in parallel. Multiprocessors come in two varieties Shared-memory multiprocessors can have more than one processor sharing a region of memory. Processes can therefore communicate via shared memory. Distributed-memory multiprocessors usually have to perform interprocess communicaton by passing messages from one processor to another. On a shared-memory multiprocessor, it is necessary to create and switch between tasks rapidly to take advantage of the available parallelism. For example, a file server may service requests for files from many different users. The clearest way to program a file system may well be to devote a separate process to each active request. But all of these processes would have the same code, and much of the same data. Does it make sense to duplicate it? For this reason, lightweight processes, or threads, were developed. A thread consists of a program counter, a register set, a stack of activation records, and (not mentioned in the text) We can consider each heavyweight process as made up of one or more threads. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 18 S&G use task to mean a heavyweight process, and thread to mean a lightweight process. What is faster about using threads than using tasks? Like processes, a thread can be scheduled onto a processor, block while waiting for system calls, and create child threads. However, unlike processes, threads are not protected from other threads. One thread can read and write the others memory, even its stack. Why doesnt this compromise security? Consider the file server again. If a file server had a single thread, what would happen when it made an I/O request on behalf of a single user? If the task had multiple threads, while one thread was , the other could . Without threads, a system designer could create parallelism through multiple (heavyweight) processes. But, Lecture 2 Operating System Principles Page 19 There is one important decision regarding the level at which threads are implemented. They are implemented either at the User-level threads: A run-time library package provides the routines necessary for thread-management operations. The libraries multiplex a potentially large number of user-defined threads on top of What are some advantages of this approach? Low-cost context-switching. Why? Existing kernel need not be modified. Flexibility. Different libraries can be developed for different applications. One application does not incur overhead from operations it does not need. E.g., one library could provide preemptive priority scheduling, while another might use FIFO. What are some limitations of user-level threads? A kernel call by any thread causes Scheduling can be unfair. The threads in a process with many threads dont get as much CPU attention as those in a process with few threads. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 20 Kernel-level threads: Kernel-level threads have their own advantages. Easy to coordinate between the synchronization and scheduling of threads, since the kernel has information concerning the status of all threads. They are suitable for applications such as server processes, where interactions with the kernel are frequent. However, Thread-management operations are more expensive than with user-level threads. Since the kernel implements threads, it has to provide any feature needed by any reasonable application. This may mean greater overhead, even for applications that do not use all the features. Example: Solaris 2. Has a library for user-level threads. Kernel knows nothing of these threads. Solaris 2 implements user-level threads, lightweight processes, and kernel-level threads. Each task contains at least one LWP. User-level threads are multiplexed onto LWPs, and can only do work when they are being run on an LWP. All operations within the kernel are executed by kernel-level threads. Each LWP has a kernel-level thread. Some threads (e.g., a thread to service disk requests) are not associated with an LWP. Lecture 2 Operating System Principles Page 21 Some kernel-level threads (e.g., device drivers) are tied to a specific processor, and some are not. A system call can pin any other thread to a specific processor, too. Kernel intervention is not required to switch between user-level threads. User-level threads can execute concurrently until they need kernel services. Then they must obtain an LWP, and will be blocked if they cannot. However, a task need not be blocked waiting for I/O to complete. If it has multiple LWPs, others can continue while one is blocked. Even if all the tasks LWPs are blocked, user-level threads can continue to run. Interprocess communication (S&G 4.6): There are two ways in which processes may communicate. Shared memory. Message passing. Message-passing is a higher-level interprocess-communication mechanism because the operating system is responsible for enforcing synchronization. Two basic operations: Send a message to a process. Receive a message from a process. There are several degrees of freedom in designing message systems: How are links established? Are links unidirectional or bidirectional? Can a link be accessible to more than two processes? How much (if any) buffer space is associated with a link? Are messages fixed size or variable size? 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 22 One option is for processes to send messages directly to each other. For example, in the producer/consumer problem, the processes might communicate like this: Outline for Lecture 3 I. Interprocess communication A. B. C. D. E. Send & receive Buffer capacities Synchronization Exceptional conds. The Mach msg. system producer: while do begin produce an item in nextp; send(consumer, nextp); end; consumer: while do begin receive(producer, nextc); consume the item in nextc; end; II. Scheduling disciplines A. Criteria for B. FCFS C. LCFS D. SJN E. SRT F. RR G. Priority disciplines H. Multilevel priority queues I. Real-time scheduling S&G calls this direct communication. What are some shortcomings of this scheme? In most systems, message operations instead name message buffers, usually known as mailboxes or ports. send(A, message) send a message to mailbox A. receive(A, message) receive a message from mailbox A. Lecture 3 Operating System Principles Page 23 S&G calls this indirect communication. A process that creates a mailbox (e.g., via a call to the memory manager) is the owner of the mailbox. Generally, it is the owners responsibility to allow other processes to use the mailbox. However, the operating system may also create mailboxes; these mailboxes can outlive individual processes. In some systems, only two processes can share a mailboxone sender and one receiver. Some message systems allow more than two processes to share a mailbox. Regardless of how many processes share a mailbox, only one may update it (by inserting or removing a message) at a time. Usually messages are buffered in FIFO order. A process that does a receive retrieves the oldest message in the mailbox. Similarly, two processes may share more than one mailbox, if they wish. Mailbox capacities (S&G 4.6.3): Messages which have been sent but not yet received are buffered in the mailbox. The capacity determines the number of messages that may be buffered at a time. Unbounded capacity. Mailbox size is potentially infinite-can grow to occupy all of primary and secondary storage, for example. Thus, mailbox is never full; it is always possible for the sender to place another message in it. Bounded capacity. A mailbox has a finite size; can get full. If a send is performed to a full mailbox, special action is necessary. Zero capacity. The mailbox cant hold any messages. Special action (described below) is always necessary. This kind of interaction is called a rendezvous. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 24 In the zero-capacity case, a sending process knows that the receiver got the message In the other cases, a sending process cant be sure whether its message has been received. Thus, if the sending process P wants to be sure, the receiving process Q needs to send an acknowledgment message. Thus, process P executes send(mbox, message); receive(ackbox, reply); Process Q executes receive(mbox, message); send(ackbox, reply); Why are two mailboxes needed? In the Thoth operating system, there was a reply primitive as well as send and receive operations. A process that did a send was delayed until the reply was received. The reply overwrote the message that had been sent. The only difference between send and reply was that the latter did not block the sender. This is actually a special case of remote procedure call (RPC). In a remote procedure call, the caller sends a message to the callee, passing the parameters in the message body; the receiver performs some procedure, and then sends a message back, returning the result (if there is one) in the message. The caller waits for the RPC to be completed. This is just what happens in an ordinary procedure call, where the caller does not continue until the callee is finished. Lecture 3 Operating System Principles Page 25 Synchronization: Special action is necessary when a process attempts to receive from an empty mailbox, or when a process attempts to send to a full mailbox. Some systems allow up to three kinds of special actions: the receiving (sending) process is blocked until a message arrives (is removed). a special error code is returned that means try again later. the receiving (sending) process is placed on a queue to be notified when a message (space) becomes available, although it is not blocked in the meantime. Example : Assume the special action is to block a process. Mailbox empty Ernie sends Now one message in mailbox. John receives The message sent by Ernie. Joe receives No message there Joe is blocked. Mike sends Joe receives message and is unblocked. Exception conditions (S&G 4.6.3): Process terminated. Suppose that a process P sends or receives a message from a process Q that has terminated. What happens if P tries to receive a message from a process that has terminated? if P sends a message to a process that has terminated? It depends on whether the mailbox is If so, 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 26 If not, Here too, the system should either terminate P or notify P that Q has terminated. Lost messages. If it is important to detect lost messages, either the OS or the sending process may take on this responsibility. One way of detecting lost messages is via timeouts: Require a reply message, and if it is not received soon enough, assume that it is lost. What problem could this cause if the message is not really lost? Message passing by value vs. message passing by reference : Similar to parameter passing in Pascal. pass by value: copy the entire message (which may be very long). pass by reference: pass a pointer to the message. If pass-by-value is used, mailbox has to be bigger, and probably has to contain a length indication for each message a lot of work may be wasted if the recipient only refers to a few bytes of the message If pass-by-reference is used, Lecture 3 Operating System Principles Page 27 The Mach message system (S&G 4.6.4): Mach is a distributed extension of Unix developed at Carnegie Mellon University. In Mach, all inter-task communication is done via messages, which are sent to ports. System calls are also performed through messages. Two ports are created for each task. The kernel mailbox is used by the task to communicate with the kernel. The notify mailbox is where the kernel sends notification of event occurrences. There are three message operations: msg_send, msg_receive, and msg_rpc. A message has a fixed-length header and a variable-length body. The body is a list of typed data items. Different types of data items that may be sent include access rights, task states, and memory segments. Sending a segment of memory is a way to avoid copying a message twice: This double copying is a major reason why many message systems are slow. If a mailbox is full, the sending thread has four options. Wait indefinitely until there is room in the mailbox. Wait at most n milliseconds. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 28 Do not wait at all, but return immediately. Temporarily cache a message under a rock. The rock can hide only one message. The last option allows a time-critical process to continue even if the target port is full. This allows, e.g., a line-printer driver to inform a task that its request has been performed, even if the tasks port is full. Why is it important to do this? Scheduling disciplines: (different disciplines may be used by the low-level and the high-level schedulers.) Criteria for scheduling disciplines : Scheduling disciplines can be evaluated by according to how well they meet the following goals. 1. Maximize processor utilization. The processor should be kept as busy as possible. 2. Maximize throughput. Throughput is the number of processes that complete per time unit. 3. Minimize turnaround time. Turnaround time = time between a request for process creation, and process termination. A good scheduling strategy attempts to minimize average turnaround time (over a set of processes). 4. Minimize response time. Response time = time between a request for process creation, and when the first response is produced. This is important in Lecture 3 Operating System Principles Page 29 Scheduling disciplines may be either preemptive or nonpreemptive. With a non-preemptive scheduling discipline, once a processor is given to a process (sometimes called a job), the job continues to use the processor until it finishes using the processor. With a preemptive scheduling discipline, a process may have the processor taken away from it by a more deserving process. Let us now consider several specific scheduling disciplines. FCFS scheduling (First-come, first-served. Also known as FIFO.) The simplest strategy to implement. Occasionally the most appropriate. Jobs are scheduled according to time of arrival. All jobs run to completion (no preemption). This policy requires small jobs to wait for large jobs, but gives waiting times with a low variance. Example : Process Processing time 24 3 3 P1 P2 P3 Suppose the processes arrive in the order P1, P2, P3 . Here is a Gantt chart of the execution of these processes. P1 0 P2 P3 24 27 30 Turnaround times:* Average turnaround time: P1 P2 P3 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 30 LCFS scheduling (last-come, first-served). Every arrival changes the choice. Short jobs get very good service, but long jobs are postponed indefinitely. How would LCFS work on our example from above (assuming all 3 jobs arrived by time 0)? P3 P2 0 3 6 P1 30 Turnaround times: Average turnaround time: P1 P2 P3 Why did this schedule perform so much better than the previous one? This experience suggests the following discipline: SJN scheduling (shortest job next). Non-preemptive. Chooses waiting process with smallest estimated runtime-to-completion. Reduces average waiting time over FCFS, but waiting times have larger variance. Favors short jobs at the expense of long jobs. Less useful in a time-sharing environment. SRT scheduling (shortest remaining time) Preemptive counterpart of SJN A running process may be preempted by a new process with a shorter estimated time. Lecture 3 Operating System Principles Page 31 SRT minimizes average waiting times over a set of jobs. RR scheduling (round-robin). The most common low-level scheduling discipline. Effective for time-sharing. Must choose a quantum , or time slice . The process is preempted when the quantum expires, and placed at the end of the ready queue. Preemption Entering jobs Processor Completion Note: A quantum is not a fixed time slot. Each job gets a full quantum, regardless of whether the previous job used up its quantum. The processor is effectively shared among the jobs. A short job is not forced to wait for completion of longer jobs ahead of it. Choice of quantum size : Short quanta Long quanta Rules of thumb: Quantum = 100 (preemption overhead) 80% of CPU bursts should be shorter than quantum. (What is a CPU burst?) Usually, a quantum q is between 10100 msec. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 32 Notice that if there are N processes in the ready queue, then each process gets 1/N th of the processor time in chunks of at most q time units at once. No process waits more than next chunk. time units before its PS scheduling (processor-sharing). Cannot be implemented, but a useful way of modeling RR. PS is defined as Quantum 0 Lim (RR with zero overhead) Lim RR?) (What is Quantum This is a reasonable approximation of RR when the quantum is Large with respect to overhead. Small with respect to mean CPU burst. Special definition of waiting time for PS discipline: Waiting time = Completion time Priority disciplines are scheduling disciplines where each process is assigned a priority number (e.g., an integer). The processor is allocated to the process with the highest priority. SJN is a priority discipline where the priority is the (predicted) processing time. Problem : Starvationlow-priority processes may never execute. Solution : Agingincrease the priority as time progresses. Lecture 3 Operating System Principles Page 33 Example : Process Arrival Procestime sing time P1 0 6 P2 3 16 P3 4 8 Priority 3 5 7 (High priority number = high priority.) Lets draw a Gantt chart: 0 4 8 12 16 20 24 28 Multilevel priority queues: Several queues; highest is RR, and all the rest are FCFS. Preemption Server Level-n queue Completion Preemption Server Level-2 queue Preemption Server Level-1 queue Completion Completion A job arrives initially at the level-1 queue. Each time a job is selected for processing, it is taken from the lowest non-empty queue. A job may move between the queues. The quantum usually increases as the job moves to higher levels. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 34 Example : Queue 1 Time quantum 8 msec. 2 16 msec. 3 RRquantum 24 msec. A new arrival enters queue 1, which is served FCFS. When it arrives at the processor, it gets 8 msec. If it does not finish in 8 msec., it is moved to queue 2. Again it is served FCFS and gets 16 more msec. If it does not finish, it moves to queue 3 where it is served RR. Favors very short jobs even more than RR. Sometimes disciplines other than FCFS and RR may be used at different levels. To use this discipline, several parameters need to be defined: The number of queues. The scheduling discipline within each queue. When do jobs move up to another queue? When (if ever) do they move down? Which queue is entered by a newly arriving job? Multiprocessor scheduling: Scheduling is more complex when multiple processors are available. When can a particular process be executed by any processor in the system? When Lecture 3 Operating System Principles Page 35 The possibility of load-sharing or load-balancing arises. The processors can use a common ready queue. The OS must be programmed so that only one processor at a time updates the queue. Two kinds of control. Asymmetric multiprocessing. Symmetric multiprocessing. Real-time scheduling (S&G, 5.5): Real-time systems are divided into two types. Hard real-time systems are required to complete a critical task within a guaranteed amount of time. Soft real-time systems require that critical processes receive priority over other processes. Hard real-time systems may employ deadline scheduling. This guarantees different job classes (e.g. real-time processes, interactive users, batch jobs) a certain fraction of processing time. Each job class is guaranteed a certain fraction fi of processor time within each duty cycle. Each job in the class may be allocated some portion of this fi. If fi 1, then each process is guaranteed to receive its quota of processing time in every cycle of ready time. What characteristics of a system might make it impossible to honor these guarantees? 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 36 Soft real-time systems have priority scheduling, with real-time processes having the highest priority. The dispatch latency must be small. How to lower the dispatch latency? Problem is that many OSs have to wait for a system call to complete or an I/O block to occur before doing a context switch. Therefore, system calls must be made preemptible by inserting preemption points in long-duration system calls. At each preemption point, the system checks whether a high-priority process needs to execute. Sometimes it is difficult to find enough places to add preemption points. making the entire kernel preemptible. Then all kernel data structures must be protected by synchronization mechanisms. What happens if the higher-priority process needs to access data structures that are currently being accessed by a lower-priority process? This is called priority inversion. In fact, there could be a chain of processes preventing access to the data structure. This problem can be attacked with the priority-inheritance protocol. What do you think it does? Lecture 3 Operating System Principles Page 37 Let us list the components of the response interval of a real-time process. Other considerations in scheduling : A job which uses much more processor time than I/O is said to be CPU bound. A job which does much more I/O than processing is called I/O bound. A scheduler should try to activate some CPU-bound and some I/O-bound jobs at the same time. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 38 Process synchronization: A computer system may have concurrent processes because of multiprogramming (multitasking) one process may be started before another one finishes, and multiprocessingthere may be more than one (central) processor and I/O devices. Until now, we have assumed that a program in execution is a single process. Outline for Lecture 4 I. Process synchronization A. Precedence relations B. Mutually noninterfering systems C. Maximally parallel systems II. Specification of concurrency A. B. C. D. Fork & join parbegin parend Concurrent tasks Race conditions In almost any program, some activities can be carried out concurrently or in arbitrary order. Thus we may have concurrent processes within a single program. Consider this short program: read (a ); b := sin(2.0); c := a +b ; write (c ); It doesnt matter which order the first two statements are executed in. They could even be executed concurrently. A precedence graph shows the order in which activities can execute. read (a ) b := sin (2.0) c := a +b write (c ) A precedence graph is a directed acyclic graph whose nodes correspond to individual statements or program fragments. Lecture 4 Operating System Principles Page 39 We must unfold loops to create a precedence graph. In a precedence graph, an edge from Pi toPj means that Pi must execute before Pj . Here is a more complicated precedence graph: P1 P2 P4 P5 P6 P3 P2 and P3 can be executed after P1 completes. P4 can be executed after P2 completes. P7 P7 can only execute after P5 , P6 , andP3 complete. P3 can be executed concurrently with P2, P4, P5, and P6. Precedence relations: A precedence graph illustrates a precedence relation. Let P = {P1, P2, , Pn} be a set of cooperating processes. The execution order of P is defined by a relation , a set of ordered pairs from P P: = { (Pi, Pj) | Pi must complete before Pj can start} Pi Pj means (Pi, Pj) . Pi is a direct predecessor of Pj. If a sequence Pi Pj Pk, we say Pi is a predecessor of Pk. Pk is a successor of Pi. Two processes are independent if neither is a predecessor of the other. A redundant edge in a precedence graph is an edge (Pi, Pj) such that there is a longer path between Pi and Pj; 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 40 Precedence relations may include redundant edges; precedence graphs should not. What is a possible precedence relation for the precedence graph on the previous page? P = {P1, P2, P3, P4, P5, P6, P7} = Mutually noninterfering systems: A system is called determinate if the sequence of values written in each memory cell is the same for any order of execution allowed by P. How can we decide if a system of processes is determinate? Let us define Read set set of variables read from memory. More precisely, the read set R(Pi ) is the set of variables referenced during the execution of process (statement) Pi that were not written by Pi. Write set set of variables written to memory. The write set W(Pi ) is the set of variables referenced during the execution of process Pi that were written by Pi . For the program at the start of the section, what is R(read (a )) R(c := a +b ) W(read (a )) W(c := a +b ) Let Pi and Pj be two processes (statements). Pi and Pj can be executed concurrently if and only if R(P i ) W (P j ) = W (P i ) R(P j ) = W (P i ) W (P j ) = Lecture 4 Operating System Principles Page 41 These are called Bernsteins conditions. Two processes are mutually noninterfering if if A mutually noninterfering system of processes is determinate . Maximally parallel systems: It is important for systems to be determinate. Why? But, it is not good to over-specify precedence constraints either. Why? A system is called maximally parallel if the removal of any pair (Pi, Pj) from makes Pi and Pj interfering. Given a system of processes, how can we construct a maximally parallel system? The maximally parallel system equivalent* to a determinate system P is formed by replacing the original precedence relation with a new precedence relation defined by (Pi, Pj) is in iff Pi is a predecessor of Pj in , and Pi and Pj do not satisfy Bernsteins conditions. *What does equivalent mean? 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 42 Specification of concurrency: fork and join constructs : The earliest (and lowest-level) language construct for specifying concurrency. Syntax : fork L; Starts an independent process at the label L. The computation that executed the fork, however, continues execution at the instruction that follows the fork. Splits an instruction stream into two streams that can be executed simultaneously by independent processors. Like a goto that is taken by one processor and not taken by the other. Syntax : join n ; Inverse of fork; brings independent instruction streams together. Causes n independent streams to merge. The execution of instructions following a join n instruction will not take place until n independent processes have executed the join n instruction. A process blocks until enough others have executed the join. Then all but the last process terminate. The execution of join n has the following effect: n := n 1; if n 0 then quit; The place where a join occurs is known as a synchronization point. Here are precedence graphs for fork and join: P1 fork P1 P 2 Pn join P2 Lecture 4 P3 Pn +1 Page 43 Operating System Principles Lets use fork and join to code our examples: count := 2 fork L1 ; read (a ); goto L2 ; L1 : b := sin(2.0); L2 : join count ; c := a +b ; write (c ); L2 : P1; count := 3; fork L1 ; P2; P4; fork L2 ; goto L3 ; goto L3 ; L1 : L3 : join count ; The fork and join constructs are inconvenient to use because they force the programmer to specify control flow at a very low, goto-like level. The parbegin parend construct : parbegin P1; P2; ; Pn parend; All statements enclosed between parbegin and parend can be executed concurrently. This statement implements the precedence graph P0 P1 P2 Pn P +1 n 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 44 We can now redo our examples using this statement. Note that parbegin parends may be nested. parbegin read (a ); b := sin(2.0); parend; c := a +b ; write (c ); parbegin P3; parbegin parend; parend; There are some precedence graphs that the parbegin parend construct cannot implement. Can you cite an example? On the other hand, any parbegin P1; P2; ; Pn parend construct can be simulated using fork/join: L2: L3: Ln : LL : count := n ; fork L2 ; fork L3 ; : fork Ln ; P1: goto LL; P2: goto LL; P3: goto LL; : Pn ; join count; Lecture 4 Operating System Principles Page 45 Concurrent tasks (6.1 of S&G): A task is a sequence of actions performed in sequential order. There are several advantages to defining concurrent tasks in a program: Concurrent execution on multiple processors. Theoretical performance improvement on a single processor. Simpler solutions. How do concurrent tasks communicate and synchronize with each other? There are two basic ways: Via shared memory: p1 Via message passing: Shared variables send receive p2 p1 p2 Race conditions and their avoidance: It is rarely possible to decompose programs into tasks which share no information. Bernsteins conditions must strictly control order in which shared variables are read and updated. Example : j := 10; parbegin write (j ); j := 1000; parend; A race condition occurs when the outcome of a computation depends on the relative speeds of its tasks. One of the simplest examples of a race condition occurs when two tasks are attempting to update the same variable: p1: count := count +1; p2: count := count 1; (1a) R1 := count; (1b) R1 := R1 +1; (1c) count := R1; (2a) R2 := count; (2b) R2 := R2 1; (2c) count := R2; 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 46 Which order of execution of these sequences will give an incorrect result? Outline for Lecture 5 I. Critical sections A. Constraints on soln. B. Petersons algorithm C. Bakery algorithm To prevent accidents like this, we must assure that only one process at a time accesses shared variables. This is called the mutual exclusion problem. II. Hardware assistance with synchronization A region of code that can be executed by only one process at a time is called a critical section. In general, we need to implement while do begin ; { Non-critical section } mutexbegin ; { Critical section } mutexend; ; { Non-critical section } end; mutexbegin is called the entry code and mutexend the exit code. To provide a general solution, our code must satisfy three constraints: 1. Mutual exclusion: If one process is executing a critical section, then no other process can be executing that critical section. 2. No mutual blocking: When a process is not in its critical section, it may not prevent other processes from entering their critical sections. When a decision must be made to admit one of the competing processes to its critical section, the selection cannot be postponed indefinitely. 3. Fairness: If process pi wants to enter its critical section, other processes must not be allowed to enter their critical section arbitrarily often before pi is allowed to enter. Lecture 5 Operating System Principles Page 47 How do we implement mutexbegin and mutexend? mutexbegin: Check whether there is any other process in critical section. If so, entering process must wait. When no other process is in a critical section, the process proceeds past mutexbegin, setting an indicator so that other processes reaching a mutexbegin will wait. mutexend: Must allow a waiting process (if there is one) to enter its critical section. Seems easy! Just use a flag called occupied (initialized to false) to indicate that no process is in a critical section. The code for each process looks like this: while do begin ; { Non-critical section } while occupied do begin end; occupied := true; ; { Critical section } occupied := false; ; { Non-critical section } end; Whats wrong with this solution? A correct solution is surprisingly difficult to achieve. (See the development in S&G, pp. 159162.) The correct solution uses A shared variable turn that keeps track of which processs turn it is to enter the critical section. A shared array flag that keeps track of which process is in the critical section: var flag : array [0 . . 1] of boolean ; Initially, elements of flag are false and turn is either 0 or 1. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 48 Petersons algorithm: while do begin {Non-critical section } flag [i ] := true ; j := (i +1) mod 2; turn := j; while flag [j ] and turn =j do begin end; ; { Critical section } flag [i ] := false ; ; {Non-critical section } end; We want to prove three things about this algorithm: (a) Mutual exclusion. (b) No mutual blocking. (c) Fairness. Proof : (a) Mutual exclusion is preserved: If it were not, and both processes ended up in their critical section, we would have flag [0] = flag [1] = true. Both processes could not have passed their inner while loops at the same time, because turn would have been favorable to only one of the processes. This implies that after one process (say p0) entered, turn was later set to a favorable value for the other process (p1). But this cant happen. Why? Hence p1 cannot enter its critical section until p0 leaves, so mutual exclusion is preserved. Lecture 5 Operating System Principles Page 49 (b) Mutual blocking cannot occur: Consider p1. It has only one wait loop (the inner while) loop. Assume it can be forced to wait there forever. After a finite amount of time, p0 will be doing one of three general things: (i) not trying to enter, (ii) waiting at its inner while loop, or (iii) repeatedly cycling through its critical & non-critical sections. In case (i), p1 can enter because flag [0] is false. Case (ii) is impossible, because turn must be either 0 or 1, so both processes cant be waiting at once. In case (iii), p0 will set turn to 1 and never change it back to 0, allowing p1 to enter. (c) Fairness. After it indicates a desire to enter the critical section by setting flag [i] to one, neither process can be forced to wait for more than one critical-section execution by the other before it enters. Assume that p0 is in its critical section and p1 wishes to enter. Before p0 can enter a second time, it sets turn to 1, which prevents it from entering before p1 enters and completes its critical section. More than two processes: the bakery algorithm: Based on each process taking a number when it wants to enter its critical section. The process with the lowest number is served next. However, two processes may receive the same number. In this case, the tie is broken in favor of the process with the lowest pid. The common data structures are type process = 0 . . n1; var choosing: array [process ] of boolean; number: array [process ] of integer; a<c Define (a, b ) < (c, d) if or if a = c and b < d. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 50 Here is the code: while do begin choosing[i] := true; number[i] := max(number[0], number[1], , number[n1]) + 1; choosing[i] := false; for j := 0 to n1 do begin while choosing[j] do begin end; while number[j] 0 and (number[j], j) < (number[i], i) do begin end; end; ; { Critical section } number[i] := 0; ; { Non-critical section } end; To show that the algorithm is correct, we need to show that if a process arrives while another process Pi is in its critical section, the new process gets a numberactually, a (number, pid) pair that is higher than process Pis (number, pid) pair. Once we have shown that, the proof follows easily. Let Pk be the new process that arrives while Pi is in its critical section. When it executes the second while loop for j = i, it finds number[i] 0 and (number[i], i) < (number[k], k). Thus, it continues looping until Pi leaves its critical section. Since the processes enter their critical section FCFS, progress and bounded waiting are guaranteed. Hardware assistance with synchronization: Purely software synchronization is somewhat complicated expensive Yet hardware assistance can be provided relatively easily. Lecture 5 Operating System Principles Page 51 Some instruction sets include a swap instruction, which provides the following indivisible action: procedure swap (var a, b : boolean ); var temp : boolean ; begin temp := a ; a := b ; b := temp ; end; Here is a solution to the critical-section problem using swap : A global boolean variable lock is initialized to false. key is a local variable in each process. while do begin ; { Non-critical section } key := true ; repeat swap (lock , key ) until not key ; ; { Critical section } lock := false ; ; { Non-critical section } end; Many other instruction sets include a test_and_set instruction, which performs the following indivisible action: function test_and_set (var target : boolean ): boolean ; begin test_and_set := target ; target := true ; end; In other words, it returns the previous value of target , and regardless of what that value was, leaves target true. With test_and_set, its trivial to implement mutexbegin and mutexend. Lets rewrite our sample critical-section program. while do begin ; { Non-critical section } 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 52 ; { Critical section } lock := false ; ; { Non-critical section } end; This solution allows starvation. Why? A complete solution is given on the next page. To provide a starvation-free critical-section implementation using test_and_set, we declare these global variables: var waiting : array [0 . . n 1] of boolean ; lock: boolean ; These variables are initialized to false. Each process has the following local variables: var j : 0 . . n 1; key : boolean ; Here is the code: while do begin ; { Non-critical section } waiting [i ] := true ; key := true ; while waiting [i ] and key do key := test_and_set (lock ); waiting [i ] := false ; ; { Critical section } j := (i +1) mod n ; while (j i and not waiting [j ]) do j := (j +1) mod n ; if j = i then lock := false else waiting [j ] := false ; ; { Non-critical section } end; A proof that this solution satisfies the three requirements is given in Peterson & Silberschatz (pp. 161162). Lecture 5 Operating System Principles Page 53 In general, processes are allowed inside the critical section in cyclic order (0, 1, , n1, 0, ) as long as any process wants to enter. If no process wants to enter, lock is set to false, allowing the first process that decides to enter to do so. On a uniprocessor, instead of using test_and_set, one can merely disable interrupts. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 54 Semaphores (6.4 of S&G): For simple problems, test_and_set is fine, but it doesnt work well for more complicated synchronization problems. A semaphore is a more flexible synchronization mechanism. There are two kinds of semaphores: Outline for Lecture 6 I. Semaphores A. P and V operations B. Software implementn C. Hardware implementn D. Deadlock & starvation II. Classical problems A. B. C. D. Resource allocation Bounded buffer Readers & writers Dining philosophers Binary semaphores : Can only take on values 0 and 1. Counting semaphores : Can assume integer values. Unless otherwise stated, we will be considering counting semaphores. Two indivisible operations are defined on a semaphore s (of either kind): The P operation, P(s): s := s 1; if s < 0 then wait till some process performs a V(s); [Alternate definition for binary semaphore if s > 0 then s := s 1 else wait till some process performs a V(s); ] The V operation, V(s): s := s + 1; if process(es) are waiting on s then let one of the waiting processes proceed; If several processes attempt a P(s) on a binary semaphore simultaneously, only one will be able to proceed. For waiting processes, any queueing discipline may be used (e.g., FCFS). Lecture 6 Operating System Principles Page 55 Note: If s < 0, its value is the negative of the number of processes that are waiting. Critical-section implementation is (even more) trivial: while do begin ; { Non-critical section } ; { Critical section } ; { Non-critical section } end; What should be the initial value of mutex? By executing a P operation on a semaphore sleeper_priv_sem , a process can block itself to await a certain event (a V operation on sleeper_priv_sem ): sleeper process: awakener process: { Compute } { Compute } P(sleeper_priv_sem); V(sleeper_priv_sem); { Compute } { Compute } What should be the initial value of sleeper_priv_sem ? Semaphores can be implemented in software, using busy-waiting. in software, using queues with hardware assistance, using queues. Here is the busy-waiting implementation: procedure P (s: semaphore) ; var blocked : boolean ; begin blocked := true ; while blocked do begin 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 56 mutexbegin; if s.val > 0 then begin s.val := s.val 1; blocked := false end mutexend; end; end; procedure V(s: semaphore); begin mutexbegin; s.val := s.val + 1; mutexend; end; Notice that this results in a haphazard queueing discipline. If the queue implementation is used, a semaphore is declared as type semaphore = record val: integer; q: queue; end; In overview, the code (for binary semaphores) is procedure P(s: semaphore); begin mutexbegin; if s.val > 0 then s.val := s.val 1 else add this process to semaphore queue and suspend it mutexend; end; Lecture 6 Operating System Principles Page 57 procedure V(s: semaphore); begin mutexbegin; if s.q = nil { Empty queue } then s.val := s.val + 1 else remove a process from s.q and resume it mutexend; end; Busy-waiting is still needed, but it is limited to the implementation of mutexbegin. With hardware assistance, P and V would indivisibly perform a decrement (or increment), and test the semaphore value and if necessary, perform queue manipulations to add or delete a process from the semaphore queue, then exit to the scheduler. Regardless of method used, P and V operations have to be part of the kernel because They must be available to all processes (without a process switch), and hence must be implemented at a low level. A P operation may result in a process being blocked needs authority to invoke the scheduler. The V operation must be accessible to the interrupt routines. Deadlocks and starvation: If two processes are waiting on semaphores, it is possible for each to be waiting for an event that can only be generated by the other. Thus, both processes will They are considered to be deadlocked. Consider a system with two processes, P0 and P1, each accessing two semaphores that are initialized to 1: 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 58 P0 P(S); P(Q); : V(S); V(Q); P1 P(Q); P(S); : V(Q); V(S); How does the deadlock occur? What order must the events occur in? We say that a set of processes is deadlocked when each process in the set is waiting for an event that can be caused only by another process in the set. A similar problem is starvation, where processes wait indefinitely just to acquire the semaphore. It can occur, for example, if queueing for the semaphore is LIFO (or haphazard). Classical process coordination problems (6.5 of S&G). Resource allocation: A counting semaphore can be used to designate how much of a resource is available. var priv_sem: array [0 . . n1] of semaphore; mutex: semaphore; blocked: boolean; Lecture 6 Operating System Principles Page 59 procedure request(i: process; j: ResType); begin blocked := false; P(mutex); if avail[j] = 0 then begin enqueue(i, j); { Enqueue process i to wait for resource j } blocked := true; end else avail[j] := avail[j] 1; V(mutex); if blocked then P(priv_sem[i]); end; procedure release(j: ResType); begin P(mutex); avail[j] := avail[j] + 1; if k := dequeue(j) {I. e., if a process k is waiting for resource j } then begin avail[j] := avail[j] 1; V(priv_sem[k]); end; V(mutex); end; The bounded-buffer problem : Assume a buffer pool has n buffers, each capable of holding one item. Items are deposited into buffers by process producer and removed from buffers by process consumer. Mutual exclusion must be provided for the buffer pool. Semaphores are used to provide mutual exclusion for the buffer (binary semaphore mutex ); keep track of how many buffers are empty (counting semaphore); and keep track of how many buffers are full (counting semaphore). 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 60 type item = ; var buffer : ; full , empty , mutex : semaphore ; nextp , nextc : item ; begin full := 0; { 0 of buffer slots are full } empty := n ; {n of buffer slots are empty } mutex := 1; parbegin producer : while do begin produce an item in nextp ; P (empty ); P (mutex ); add nextp to buffer ; V (mutex ); V (full ); end; consumer : while do begin P (full ); P (mutex ) remove an item from buffer to nextc ; V (mutex ); V (empty ); consume the item in nextc; end; parend; end; Readers and writers: Some processes need to read [a file of] data, while other processes need to alter [write] it. Rules: Any number of processes may read a file at a time. Only one process may write the file at a time. Lecture 6 Operating System Principles Page 61 How many semaphores do we need to code a solution? var : semaphore nreaders: integer := 0; ; parbegin reader: while do begin { Readers enter one at a time. } P(rmutex); { 1st reader waits for readers turn, then inhibits writers} if nreaders = 0 then P(wmutex); nreaders := nreaders+ 1; writer: { Allow other readers while do to enter or leave. } begin { Each writer works alone.} Read the file { Readers exit one at a time. } P(rmutex); nreaders := nreaders 1; { Last reader unblocks writers. } if nreaders = 0 then { Allow reader entry/exit. } V(rmutex); end; Dining philosophers: a classical synchronization problem. Five philosophers spend their lives eating and thinking. They sit at a table with 5 chopsticks, one between each pair of philosophers. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 62 A bowl of spaghetti is in the center of the table. When a philosopher is hungry, he attempts to pick up the two chopsticks that are next to him. When he gets his chopsticks, he holds them until he is finished eating, then goes back to thinking. This problem can be programmed by representing each chopstick by a semaphore. Grab chopstick P semaphore Release chopstick V semaphore A philosopher is represented as a process. Here is the code. while true {i.e., forever} do begin P(chopstick[i]); P(chopstick[i +1] mod 5); eat ; V(chopstick[i]); V(chopstick[i +1] mod 5); think ; end; Deadlock is possible. How? Solutions: Starvation is also possible. (literally!) Lecture 6 Operating System Principles Page 63 Eventcounts and sequencers: The computer version of For Better Service, Take a Number. Sequencera numbered tag. Eventcountthe value of the Now Serving sign (starts at 0). Outline for Lecture 7 I. Eventcounts and sequencers II. And and or synchronization A. Or synchronization B. And synchronization II. Critical regions Customers take a ticket and then await service. Operations on sequencers: A. Conditional critical regions B. The await statement ticket(S); Returns the value of the next numbered tag (starting with 0). Operations on eventcounts: await(E, v); Wait until eventcount E reaches v. advance(E); Increment eventcount E. read(E); Return current value of eventcount E. Implementation of await(E, v): if E < v then begin place the current process on the queue associated with E; resume next process; end; Implementation of advance(E): E := E + 1; resume the process(es) waiting for Es value to reach current value of E; 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 64 How would we implement a critical section? Critical section Let us recode readers and writers using sequencers and eventcounts. var Wticket, Rticket: sequencer; Win, Rin: eventcount; nreaders: integer := 0; parbegin reader: while do begin { Readers enter one at a time. } await(Rin, ticket(Rticket)); { 1st reader waits for readers turn, then inhibits writers} if nreaders = 0 then await(Win, ticket(Wticket)); nreaders := nreaders+ 1; { Allow other readers to enter. } advance(Rin); Perform read operations { Readers exit one at a time. } await(Rin, ticket(Rticket)); nreaders := nreaders 1; { Last reader unblocks writers. } if nreaders = 0 then advance(Win); { Allow reader entry or exit. } advance(Rin); end; Lecture 7 Operating System Principles Page 65 writer: while do begin { Each writer works alone. } await(Win, ticket(Wticket)); Perform write operations. { Allow other writers (or a reader) to lock out. } advance(Win); end; parend; Would the program work if the first advance(Rin) were placed directly after the first await(Rin, ticket(Rticket))? Why is the second await(Rin, ticket(Rticket)) needed? Does the program make the readers exit in the same order that they entered? And and or synchronization: Thus far, we have discussed synchronization that occurs when a process waits for some single event to occur. But in many synchronization problems, a process may be waiting for the occurrence of any one of a set of events before it takes some action (or synchronization), or the occurrence of all the events in a set before it proceeds to take action (and synchronization). 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 66 Or synchronization: Suppose a device-controller process is waiting to handle the next device interrupt. It is therefore waiting for the occurrence of a single event which can come from any device in a set of devices. It does not know, however, which device will request service next. Once an event (interrupt) has occurred, the process needs to figure out which event it was. (We could use a separate process to wait for each interrupt, but this would be wasteful.) Instead, we can list all the events and a discrimination variable as arguments to the wait: wait-or(e1, e2, , en, A); where ei denotes a semaphore (with associated event) and the A returns which event(s) occurred. The device-controller process would then use the wait-or in the following way: device-controller: begin while do begin wait(e1, e2, , en, A); case A of e1: Perform action 1; e2: Perform action 2; : : : en: Perform action n; endcase; endloop; end; Lecture 7 Operating System Principles Page 67 A nave implementation would place the waiting process on the event queue of each semaphore, e1s event queue Device-controller process e 2s event queue d-c ens event queue then remove it from all the queues when the process is awakened. e1s event queue d-c device-controller process removed e 2s event queue d-c device-controller process awakened en s event queue d-c device-controller process removed A better implementation combines all the event variables, and the discrimination parameter A into a single event variable. The new variable e stands for e1, e2, , en, and A. The process is then placed on a single queue associated with all the events. Then, when an event occurs, the process is removed from this common queue. It then tests the variable e to determine what event has occurred, and takes the appropriate action. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 68 Device-controller process common queue d-c Some event in e1, e2, , en occurs common queue d-c device-controller process awakened Here is the code: device-controller begin while do begin wait(e ); case e of e1: Perform action 1; e2: Perform action 2; : : : en: Perform action n; endcase; endloop; end (Of course, this may require checking more than one event queue when an event occurs.) And synchronization: Suppose a process needs two files at the same time, or three I/O devices. It must wait until all are available. The nave implementation is prone to deadlock: Process A P(Dmutex); P(Emutex); Process B P(Emutex); P(Dmutex); Lecture 7 Operating System Principles Page 69 What happens if both processes execute each statement at the same time (in lock step)? As we will see later, several strategies can be used to avoid deadlock. One simple strategy appropriate for semaphores is to The simultaneous P, or SP, operation is specified as follows: SP(s1, s2, , sn): mutexbegin if s1 > 0 and s2 > 0 and and sn > 0 then for i := 1 to n do si := si 1; else begin Place the process in the queue of the first si that is 0; Set program counter to reexecute entire SP operation when process is reawakened. end; mutexend; The SV operation is defined similarly: SV(s1, s2, , sn): for i := 1 to n do begin mutexbegin mutexend; if process(es) are waiting on si then let one of the waiting processes proceed. end; Using SP and SV, it is easy to implement dining philosophers: 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 70 while true {i.e., forever} do begin eat ; think ; end; Language constructs (S&G, 6.6): Although semaphores solve the critical-section problem concisely and efficiently, they are not a panacea. It is easy to make programming mistakes using semaphores. What are some of the mistakes cited by S&G? Perhaps higher-level mechanisms built into programming languages could help avoid these problems. Critical regions: Suppose our program has a shared variable v that can only be manipulated by one process at a time. Then if a language provides critical regions, the compiler will allow a program to access v only within a critical region. The variable v can be declared var v: shared T; A variable declared like this can be used only within a region statement: region v do S; Lecture 7 Operating System Principles Page 71 This means that while statement S is being executed, no other process can access the variable v. [Critical regions can implemented easily by a compiler. For each declaration like the one above, the compiler generates a semaphore v-mutex initialized to generates this code for each region statement: P(v-mutex); S; V(v-mutex); ] We can think of a process P1 that is executing in a region v as holding variable v so that no one else can access it. Suppose it tries to access another shared variable w while it is in that region (vs region). In order to access this shared variable w, what kind of statement would it have to execute? , Suppose process P1 does this while another process is doing the reverseholding variable w and trying to access v. What happens? This code brings this about A compiler should be able to check for this situation and issue a diagnostic. To solve some synchronization problems, it is necessary to use conditional critical regionscritical regions that are entered only if a certain expression B is true: region v when B do S; Suppose a process executes this statement. When it enters the critical region, 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 72 If B is true, it executes statement S. If B is false, it releases mutual exclusion and waits until B becomes true and no other process is in the critical region. Lets recode the bounded-buffer problem using conditional critical regions: var buffer: shared record pool: array [0 . . n1] of item; count, in, out: integer; end; parbegin { Means do the following in parallel.} producer: nextp: item; while do begin produce an item in nextp; region buffer when count < n do begin pool[in] := nextp; in := (in + 1) mod n; count := count + 1; end; end; consumer: nextc: item; while do begin region buffer when count > 0 do begin nextc := pool[out]; out := (out + 1) mod n; count := count 1; end; consume the item in nextc; end; parend; Lecture 7 Operating System Principles Page 73 Why is a conditional critical region used for the bounded-buffer problem? The await statement: The when clause allows a process to wait only at the beginning of a critical region. Sometimes a process needs to wait in the middle of the region. For that, an await statement can be used: region v do begin S1; await(B); S2; end; As before, if a process executes this statement, If B is true, it executes statement S2. If B is false, it releases mutual exclusion and waits until B becomes true and no other process is in the critical region. Then it continues where it left off, at statement S2. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 74 We can use await statements to code a solution to the readersand-writers problem. Our semaphore solution to readers-and-writers was a readerpreference solution, since if a reader and a writer were both waiting, the reader went ahead. The solution we will code now is a writer-preference solution. var v: shared record nreaders, nwriters: integer; busy: boolean; end; parbegin reader: region v do begin await(nwriters = 0); nreaders := nreaders + 1; end; read file region v do begin nreaders := nreaders 1; end; writer: region v do begin nwriters := nwriters + 1; await((not busy) and (nreaders = 0)); busy := true; end; write file region v do begin nwriters := nwriters 1; busy := false; end; parend; Lecture 7 Operating System Principles Page 75 Although critical regions avoid low-level programming problems like forgetting to release a semaphore, they cant assure that a resource is used correctly. For example, other code could write a file without bothering to check if someone is reading it at the same time. Or, a process could release a file it hadnt acquired yet. Monitors (S&G 6.7): According to the object model, each object has a certain number of operations defined on it. For example, a stack object has (at least) two operations: push and pop. Some programming languages (e.g., Modula-2, Ada) allow this to be programmed (approximately) as follows: Outline for Lecture 8 I. Monitors A. Wait and Signal operations B. Bounded buffer C. Dining philosophers D. Priority wait E. LOOK disk scheduler II. Path expressions A. Bounded buffer type stack = monitor { Declarations local to stack } var elts : array [1 . . 100] of integer; ptr : 0 . . 100; procedure entry push(x: integer); begin ptr := ptr + 1; elts[ptr] := x ; end; function entry pop : integer ; begin pop := elts[ptr]; ptr := ptr 1; end; begin { Initialization code } ptr := 0; end stack; A class is a special data type, which has operations associated with it. Operations denoted as entry operations can be invoked from outside by any process that has a variable of the type (e.g., stack). 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 76 Operations not marked entry can only be invoked from inside the class declaration itself. Given this declaration of the stack data type, other programs can declare stacks like this var s1, s2: stack ; and invoke procedures of the stack class like this: s1.push(45) x := s2.pop ; One problem with classes: How to stop two or more processes from manipulating stack at same time. One problem with semaphores: Code is low level and easy to get wrong. For example, if programmer forgets to put in a corresponding V operation for some P, deadlock can result. Timing dependent hard to debug. Both these deficiencies can be overcome by monitors. A monitor is a special kind of class that allows only one process at a time to be executing any of its procedures. (This property is guaranteed by the implementation of the monitor.) The stack class above can be changed into a monitor by just replacing the word class by the word monitor. Wait and Signal operations: Inside a monitor, variables of type condition may be declared. Operations on condition variables: wait on a condition variable suspends process and causes it to leave monitor signal on a condition variable causes resumption of exactly one of the waiting processes. (If there are no waiting processes, it has no effect.) Condition variables are private to a particular monitor; they may not be used outside of it. Lecture 8 Operating System Principles Page 77 Scenario: Process P attempts to enter monitor M by invoking one of its operations. P may wait in M s entry queue. After P is allowed to enter M, it has exclusive access to all of M. Exclusive access ends when P exits normally from the monitor because its operation is completed, or P waits on a condition variable (in this case, P is blocked). Exclusive access is regained after the event waited on is signaled. Meanwhile, the signaling process leaves the monitor. A monitor has these queues: the entry queue one queue associated with each condition variable the queue for signaling processes. A simple monitor (merely implements a binary semaphore): type semaphore = monitor var busy : boolean ; nonbusy : condition ; procedure entry P ; begin if busy then nonbusy.wait ; busy := true ; end; procedure entry V ; begin busy := false ; nonbusy.signal ; end; begin { Initialization code } busy := false end; 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 78 A semaphore is declared as var sem : semaphore ; and operated on by sem.P; or sem.V; Example 1: The bounded-buffer problem. type bounded_buffer = monitor var buffer : array [0 . . N 1] of portion ; lastpointer : 0 . . N 1; count : 0 . . N ; nonempty, nonfull : condition ; procedure entry append (x: portion ); begin if count = N then nonfull.wait; { 0 count < N } buffer[lastpointer] := x ; lastpointer := (lastpointer +1) mod N ; count := count + 1; nonempty.signal end; function entry remove: portion; begin if count = 0 then nonempty.wait ; { 0 < count N } remove := buffer[(lastpointercount) mod N ]; count := count 1; nonfull.signal ; end; count := 0; lastpointer := 0; end; Lecture 8 Operating System Principles Page 79 Example 2: Dining philosophers. Here is a deadlock-free solution to the dining-philosophers problem with monitors. var state: array [0 . . 4] of (thinking, hungry, eating); Philosopher i sets state [i] = eating only if his neighbors are not eating. var self: array [0 . . 4] of condition ; Philosopher i delays himself on self [i] when he is hungry but unable to obtain two forks. type dining_philosophers = monitor var state : array [0 . . 4] of (thinking, hungry, eating); self: array [0 . . 4] of condition; procedure entry pickup (i : 0 . . 4); begin state[i] := hungry; test(i); if state[i] eating then self [i ].wait; end; procedure entry putdown (i : 0 . . 4); begin state[i] := thinking; test((i 1) mod 5); test((i +1) mod 5); end; procedure test(k : 0 . . 4); begin if state[(k 1) mod 5] eating and state[k] = hungry and state[(k +1) mod 5] eating then begin state[k] := eating; self [k].signal ; end; end; begin for i := 0 to 4 do state[i] := thinking; end. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 80 An instance of dining_philosophers can be declared by var dp : dining_philosophers ; Here is the code for a process for one of the philosophers: process Pi : { philosopher i } begin dp.pickup(i ); { eat } dp.putdown(i ); end; The nested monitor-call problem : A process in one monitor M1 can call a procedure in another monitor M2. The mutual exclusion in M1 is not released while M2 is executing. This has two implications: Any process calling M1 will be blocked outside M1 during this period. If the process enters a condition queue in M2, deadlock may occur. M1 M2 Conditional waiting in monitors: If several processes are suspended on condition x and a x.signal operation is executed, which process is resumed next? Sometimes processes should be resumed in priority order. E.g., if SJN scheduling used, when the processor becomes free, the shortest job should be resumed next. Lecture 8 Operating System Principles Page 81 The priority-wait (conditional-wait) construct : Associates an integer value with a wait statement x.wait(p) p is an integer called the priority. When x.signal is executed, the process associated with the smallest priority is resumed next. The empty construct: x.empty is a boolean function that tests whether any processes are waiting on condition variable x . Example : A monitor for the LOOK algorithm. The LOOK algorithm keeps the head moving in the same direction until there are no more requests to service in that direction. Requests Head If another request comes in ahead of the head, it will be serviced on the current sweep: Requests Head If another request comes in behind the head, it will wait until the next sweep (the disk head will have to go to the other end and turn around before reaching the new request): Requests Head type disk_head = monitor var head_pos: 1 . . max_cyl_number; busy: boolean; direction: (up, down); downsweep, upsweep: condition; 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 82 procedure entry acquire (dest_cyl: 1 . . max_cyl_number); begin if busy then if head_pos < dest_cyl or (head_pos = dest_cyl and direction = down) then upsweep.wait (dest_cyl ) else { if direction is down } downsweep.wait (max_cyl_number dest_cyl ); busy := true; { Issue command to move disk arm. } head_pos := dest_cyl ; end; procedure entry release; begin busy := false; if direction = up then if upsweep.empty then begin direction := down; downsweep.signal ; end else upsweep.signal else if downsweep.empty then begin direction := up; upsweep.signal ; end else downsweep.signal end; begin head_pos := 1; direction := up; busy := false; end; If da is an instance of the disk_head monitor, a process that wants to use the disk executes code of this form: da.acquire (i ); : { Access cylinder i } : da.release ; Lecture 8 Operating System Principles Page 83 Notes on the program: We need two condition variables to specify priority. Without priority wait, we would need one condition variable per cylinder. release signals a request in the current direction. If there is none, the direction is reversed. Which condition variable does a would-be acquirer wait on if the head position is below the desired cylinder? if the head position is above the desired cylinder? Does it make a difference which direction the head is moving? What priorities are specified if the head is at cylinder 25 and processes request cylinders 100 and 200? the head is at cylinder 250 and requests arrive for cylinders 50 and 150? Why isnt the head_pos test simply if head_pos < dest_cyl ? How can we be sure that processes use this monitor correctly? If processes always make calls on the monitor in a correct sequence, and if no uncooperative process accesses the disk without using the monitor. For large systems, access control methods needed, 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 84 Path expressions: Monitors are compatible with the object model, because they define the operations on a resource in the same place as the resource itself. But synchronization using wait and signal tends to be scattered throughout the monitors procedures. This impairs the readability of the code. To attack this problem, path expressions express order of execution in a single line. Path expressions are used within objects that are similar to monitors. Procedures within objects are not automatically mutually exclusive. All execution constraints are declared at the beginning of the object, using a path expression. Its syntax is path restriction expression end A restriction expression is defined as follows. 1. A procedure name P is a restriction expression; a single procedure name implies no restriction. 2. If P1 and P2 are restriction expressions, then each of the following is also a restriction expression: P1, P2 denotes concurrent execution. No restriction is placed on the order in which P1 and P2 are invoked, or on the number of concurrent executions (i.e., any number of each can execute concurrently). P1; P2 denotes sequential execution. A separate invocation of P1 must complete before each invocation of P2. The execution of P2 does not inhibit the execution of P1; many different invocations of P1 and P2 may be active concurrently, as long as the # of P2s that have begun execution is the # of P1s that have completed. Lecture 8 Operating System Principles Page 85 n: (P1) denotes resource restriction. It allows at most n separate invocations of P1 to execute simultaneously. [P1] denotes resource derestriction. It allows an arbitrary number of invocations of P1 to coexist simultaneously. Examples: 1. path 1: (P1) end Procedure P1 must be executed sequentially; only one invocation may be active at a time. 2. path 1: (P1), P2 end Multiple invocations of P1 result in sequential execution, while no restriction is placed on P2 (i.e., any number of invocations of P2 are allowed.) 3. path 1: (P1), 1: (P2) end A maximum of one P1 and one P2 can execute concurrently. 4. path 6: (5:(P1), 4:(P2)) end As many as five invocations of P1 and four invocations of P2 can proceed concurrently, as long as the overall limit of six invocations is not exceeded. 5. path 5: (P1; P2) end Each invocation of P2 must be preceded by a complete execution of P1; at most five invocations of P1 followed by P2 may proceed concurrently. (I.e., P1 can get ahead of P2 by at most five invocations.) 6. path 1: ([P1], [P2]) end Procedures P1 and P2 operate in mutual exclusion, but each can concurrently execute as many times as desired. If all executions of one procedure have finished, either procedure may start again. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 86 What would be the path expression for the weak readerpreference solution to the reader/writer problem? path The bounded-buffer problem with path expressions: type bounded_buffer = object path N: (append; remove), 1: (append, remove) end var buffer : array [0 . . N 1] of portion; lastpointer : 0 . . N 1; count : 0 . . N; procedure entry append (x: portion); begin buffer [lastpointer ] := x; lastpointer := (lastpointer +1) mod N ; count := count + 1; end; function entry remove: portion; begin remove := buffer [(lastpointercount) mod N ]; count := count 1; end; begin count := 0; lastpointer := 0; end; Path expressions can easily be implemented using semaphores. Lecture 8 Operating System Principles Page 87 Synchronization in high-level languages: Several high-level languages provide constructs for creating and synchronizing parallel processes. Outline for Lecture 9 I. CSP A. The language B. Bounded buffer C. Dining philosophers CSP (Communicating Sequential Processes) , C. A. R. Hoare, CACM, August 1978. A language for expressing parallelism. II. Ada tasking A. B. C. D. Basic concepts select statement Bounded buffer else clauses Premises: Input and output are basic primitives of programming. III. Synchronization in Solaris 2 Parallel composition of communicating sequential processes is a fundamental program-structuring method. Communication occurs when one process names another as a destination for output, and the second process names the first as the source for input. Then the output value is copied to input, without automatic buffering. The language: Guards. A guard is a boolean expression. A guarded command is a guard, together with a list of statements which are executed only if the guard does not fail. guarded command : : = statements guard A guarded command set is a list of alternative guarded commands. As the list is scanned, an arbitrary command whose guard does not fail will be executed. guarded command set : : = guarded command guarded command 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 88 Semantics of alternative commands. All guards execute concurrently. (The statements within a guard are executed left-to-right.) If all guards fail, the alternative command fails. For performance reasons, the system might select the earliest completed guard. The statement list associated with the selected guard is executed. Execution of all other guards is discontinued. A nondeterministic choice is made between two simultaneous guards. Syntax for input/output commands. input command : : = source ? target variable output command : : = destination ! expression source and destination are executing concurrently. Conditions for communication. One process names the other as source for input. The other process names the first as destination for output. The type of the target variable matches the type of the output variable. Then the two statements execute concurrently, assigning the output value to the target variable. An I/O command fails when the source or destination process is terminated; its expression is undefined; or the type of the target variable of an input command is not the same as the type of the value denoted by the expression of the output command. Lecture 9 Operating System Principles Page 89 Examples: card_reader ? card_image line_printer ! line_image Parallel commands. Commands that are to execute in parallel are surrounded by brackets and separated by ||s: [ process || || process ] Here are some examples of parallel commands: [ card_reader ? card_image || line_printer ! line_image ] [ room : : ROOM || fork (i : 0 . . 4) : : FORK || phil (i : 0 . . 4) : : PHIL ] All of the parallel commands start simultaneously. The parallel command terminates when all of its commands have terminated. Each process must be disjoint. This means that it doesnt assign to non-local variables mentioned in the others; and it doesnt use non-local process names used by the others. Repetition is specified by an *: * alternative command For example, a linear search through an n -element array can be coded like this: i := 0; *[i < n ; content(i) key i := i + 1] Iteration occurs as long as one of the alternatives succeeds. (In this example, there was only one alternative.) 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 90 The bounded-buffer problem in CSP: Below is an implementation of a bounded-buffer manager. The consumer process asks for the next portion and then attempts to input it: X !more ( ); X ?p The producer provides a buffer portion via X !p X :: buffer : (0 . . 9)portion ; in, out : integer ; in := 0; out := 0; comment 0 out in out + 10; *[ in < out + 10; producer ?buffer (in mod 10) in := in + 1 out < in ; consumer ?more ( ) consumer !buffer (out mod 10); out := out + 1 ] Notes : When out < in < out + 10, the selection of the alternative depends on whether producer produces first or consumer consumes first. When out = in, the buffer is empty, and the second alternative cant be selected even if consumer is waiting. Producer cant go ahead when in = out + 10. X terminates when out = in and producer has terminated. Strictly speaking, more( ) should be a variable. It doesnt have to have the same name as in the sending process. Dining philosophers in CSP: There are eleven processes One for the room. Five for the philosophers. Five for the forks. Lecture 9 Operating System Principles Page 91 PHIL = * [THINK ; room !enter( ); fork(i) ! pickup( ); fork((i +1) mod 5) ! pickup( ); EAT ; fork(i) ! putdown( ); fork((i +1) mod 5) ! putdown( ); room !exit( ); ] FORK = * [phil (i)?pickup( ) phil (i)?putdown( ); phil ((i 1) mod 5)?pickup( ) phil ((i 1) mod 5)?putdown( ) ] ROOM = occupancy :integer; occupancy := 0; * [(i : 0 . . 4) phil(i)?enter( ) occupancy := occupancy + 1 (i : 0 . . 4) phil(i)?exit( ) occupancy := occupancy 1 ] comment All these processes operate in parallel. [room : : ROOM || fork(i : 0 . . 4) : : FORK || phil(i : 0 . . 4) : : PHIL ] This solution does not avoid starvation or deadlock. Ada tasks: A task provides several types of services, each designated by an entry name. A service request is made by an entry call. Only one entry can be executed at a time. Thus, every entry in a task behaves like a critical section with respect to itself and all other entries of the task in which it is defined. A task consists of two parts: The specification of the task. task t is entry declarations end t ; 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 92 The task body. task body t is declarations begin statements end t ; Different tasks proceed in parallel, except at points where they synchronize. Tasks synchronize when they arrive at the same rendezvous point. One task calls an entry of the other task. The other task executes an accept statement. Either action may occur first. Calling task task t1 is end t1 ; task body t1 is t2.e( ); end t1 ; Called task task t2 is entry e( ); end t2 ; task body t2 is accept e ( ) do end e ; end t2 ; Whichever task first arrives at the rendezvous point waits for the other. When both tasks are there, the accept statement is executed. Thereafter, both tasks continue in parallel. A sequence of accept statements may be used to force the calling task to request actions in a particular order. For example, accept e1 ( ) do [body of e1 ] end e1 ; accept e2 ( ) do [body of e2 ] end e2 ; accept e3 ( ) do [body of e3 ] end e3 ; Lecture 9 Operating System Principles Page 93 specifies that entry e1 must be the first executed, followed by no other entry than e2, which in turn is followed by an execution of e3. Often, a sequence of accept statements may be embedded in a loop so that they can be processed repeatedly: while loop accept e1 ( ) do end e1 ; accept e2 ( ) do end e2 ; accept e3 ( ) do end e3 ; end loop; In a file system, for example, a file might be operated on by a fileopen, followed by a sequence of read or write requests. The select statement: A select statement allows for a choice between entries. In format, it is similar to a case statement in other languages (but its effect is different). Some of the alternatives in a select statement may be qualified eligible to be executed only when a boolean expression is true. select accept e1 ( ) do end; or when => accept e2 ( ) do end; or end select; An alternative is open (eligible to be executed) if there is no when condition, or if the condition is true. An open alternative can be selected if a corresponding rendezvous is possible. If several alternatives can thus be selected, one of them is chosen at random. If no rendezvous is immediately possible, the task waits until an open alternative can be selected. The implementation of a bounded buffer below uses a when clause to prevent inserting an item in a full buffer, or removing an item from an empty buffer. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 94 task bounded_buffer is entry insert (it: in item); entry remove (it: out item); end bounded_buffer; task body bounded_buffer is buffer: array (0 . . 9) of item; in, out: integer := 0; count: integer range 0 . . 10 := 0; begin loop select when count < 10 => accept insert(it: in item) do buffer (in mod 10) := it; end; in := in + 1; count := count + 1; or when count > 0 => accept remove(it: out item); do it :=buffer(out mod 10); end; out := out + 1; count := count 1; end select; end loop; end bounded_buffer ; In this code, the in := in + 1; count := count + 1 could be placed inside the do end. What difference would it make? An else clause can be added to a select statement in case no other alternative can be executed: Lecture 9 Operating System Principles Page 95 select or or else statements end select; The else part is executed if no alternative can be immediately selected. Here is a flowchart of how an entry is chosen: select open alternatives no open alternatives possible rendezvous no possible rendezvous select one rendezvous without else with else execute else without else raise error wait The delay statement: Allows a program to wait to see if an alternative will become open within a specified period of time. delay duration An open alternative beginning with a delay statement will be executed if no other alternative has been selected before the specified duration has elapsed. Any subsequent statements of the alternative are then executed: 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 96 select accept insert( ) or accept remove( ) or delay 10; end select; The calls on insert and remove are canceled if a rendezvous is not started within 10 seconds. Delay statements and else clauses cannot both be used. Synchronization in Solaris 2 (S&G, 6.8): Before Solaris 2, SunOS used critical sections to guard multiply accessed data structures. Interrupt level was set high enough so that no interrupt could cause modification to the critical data. Solaris 2 is multi-threaded, provides real-time capabilities, and supports multiprocessors. Why wouldnt the approach of raising the interrupt level work anymore? Even if critical sections could have been used, they would have caused a large performance degradation. For synchronization, Solaris 2 uses three mechanisms: Adaptive mutexes. Condition variables. Readers-writers locks. Adaptive mutexes adapt their behavior, depending on how long they expect to have to wait. At the outset, an adaptive mutex is similar to a spinlock. But it behaves differently according to what conditions prevail. Lecture 9 Operating System Principles Page 97 Data locked? Yes No Locked by thread that is running? Yes No Set lock and access data Wait for lock Go to sleep until to become free. lock is released. Why the difference in behavior depending on whether the thread is running? How would the behavior be different on a uniprocessor system? Condition variables: Under what condition would it be inefficient to use an adaptive mutex? With a condition variable, if the desired data is locked, the thread goes to sleep. When a thread frees the lock, it signals the next sleeping thread. In order for this to be worthwhile, the cost of putting a thread to sleep and waking it + cost of context switches 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 98 must be < cost of wasting several hundred instructions in a spinlock. Readers-writers locks: Under what condition are both adaptive mutexes and condition variables inefficient? Atomic transactions (S&G, 6.9): Sometimes it is necessary to make sure that a critical section performs a single logical unit of work. That is, if some of the critical section executes, then This problem is has been around for a long time in databases. Outline for Lecture 10 I. Atomic transactions A. Log-based recovery B. Checkpoints C. Serializability & concurrent transactions D. Locking protocol E. Timestamp-based protocols A collection of instructions that performs a single logical function is called a transaction. A transaction is a sequence of read and/or write operations, terminated by a commit or an abort. If a transaction aborts, its effects must be completely undone. But, the effect of a transaction that can has committed cannot be undone by an abort operation. If a transaction aborts, the system state must be restored to Log-based recovery: Atomicity can be insured by recording on stable storage all actions performed by the transaction. The most widely used method for this is write-ahead logging. Before storage is modified, the following data is written to stable storage: Lecture 10 Operating System Principles Page 99 Transaction name. Data-item name. Old value of data item. New value of data item. Special log records record actions such as starting a transaction of a transaction. In general, a transaction record looks like this: T i starts Ti writes, nj, vj, vj : T i commits What should happen first? Updating a data item, or writing its log record out to stable storage? This imposes a certain overhead. The recovery algorithm uses two procedures: undo(Ti ) redo(T i ) Undo and redo must be idempotent (multiple executions have same effect as one execution). If a transaction Ti aborts, then we can restore the state of the data that it has updated by executing If a system failure occurs, which transactions need to be undone, and which need to be redone? Transaction Ti needs to be undone if the log contains the record but does not contain the record 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 100 Transaction Ti needs to be redone if the log contains the record and contains the record Checkpoints (S&G, 6.9.3): Obviously, we dont want to have to search the whole log whenever there is a failure. It is better if the system periodically performs checkpoints. At a checkpoint, the system outputs all log records currently residing in volatile storage outputs all modified data residing in volatile storage, and outputs a log record checkpoint. What is the recovery procedure? There are three kind of transactions to consider. 1. Transactions that start and commit before the most recent checkpoint. 2. Transactions that start before this checkpoint, but commit after it. 3. Transactions that start and commit after the most recent checkpoint. Which of these transactions need to be considered during recovery? Do all such transactions need to be undone or redone? Once transaction Ti has been identified, the redo and undo operations need to be applied only to Ti and the transactions that started after it. Lets call these transactions the set T. The recovery algorithm is this: Lecture 10 Operating System Principles Page 101 For all transactions Tk in T such that the record Tk commits appears in the log, execute For all transactions Tk in T that have no the log, execute in Serializability and concurrent atomic transactions (S&G, 6.9.4): Atomic transactions concurrent execution must be equivalent to serial execution in some order. A schedule is called equivalent to another schedule if each schedule results in the same sequence of values being written to each location. One way to assure serializability is simply to execute each transaction in a critical section. Whats wrong with this? We want to determine when an execution sequence is equivalent to a serial schedule. Suppose a system has two data items, A and B, which are both read and written by two transactions, T0 and T1. A serial schedule is shown below. T0 read(A ) write(A ) read(B ) write(B ) T1 read(A ) write(A ) read(B ) write(B ) Given n transactions, how many valid serial schedules are there? A non-serial schedule is formed when transactions overlap their execution. A non-serial schedule may be equivalent to a serial schedule if there arent any conflicting operations in the wrong places. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 102 Two operations are said to be conflicting if they access the same data, and at least one of them is a write. In the schedule above, what operations does the write(A) of T0 conflict with? Given a schedule of operations, we can produce an equivalent schedule by this procedure: If two operations are consecutive, and do not conflict, then their order can be swapped to produce an equivalent schedule. Here is a non-serial schedule. T0 read(A ) write(A ) T1 read(A ) write(A ) read(B ) write(B ) read(B ) write(B ) How can we transform it into a serial schedule? If schedule S can be transformed into a serial schedule S by a sequence of swaps of non-conflicting operations, it is said to be conflict serializable. Lecture 10 Operating System Principles Page 103 Locking protocol: How can serializability be ensured? One way is to associate a lock with each data item, require that each transaction obtain a lock before it reads or writes the item, and to assure that a certain protocol is followed in acquiring and releasing the locks. Among the modes in which data can be locked are shared (S)a transaction that has obtained a shared lock on data item Q can read Q but cannot write it. exclusive (X)a transaction that has obtained an exclusive lock on data item Q can read and/or write it. If a data item is already locked when a lock is requested, the lock can be granted only if the current lock Notice the similarity to readers and writers! Serializability is ensured if we require that each transaction is two phase. This means that it consists of a growing phase, in which the transaction may request new locks, but may not release any locks, and a shrinking phase, in which a transaction may release locks, but may not request any new locks. The two-phase locking protocol ensures conflict serializability. It does not ensure freedom from deadlock. Timestamp-based protocols: How is the order of execution determined for conflicting transactions? With a locking protocol, it is determined by which of the two is first to request a lock that involves incompatible modes. A timestamp-based protocol decides instead based on the order in which the transactions began. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 104 With a timestamp-based protocol, each transaction TI is assigned a timestamp TS(TI ) when it starts execution. If Ti starts before Tj, then TS(TI ) < TS(Tj ). How are timestamps determined? By the By a The system must ensure that the resulting schedule is equivalent to a serial schedule in which the transactions execute in timestamp order. Two timestamp values are associated with each data item: W-timestamp(Q) the highest timestamp of any transaction that successfully executed write(Q). R-timestamp(Q) the highest timestamp of any transaction that successfully executed read(Q). When does a transaction Tj need to be rolled back? Iff it attempts to read a value that has already been overwritten by a transaction that started later. I.e., if it attempts to write a value that has either been read or written by a transaction that started later. I.e., if When are the R- or W-timestamps set to new values? What new values are these timestamps set to? Lecture 10 Operating System Principles Page 105 Here is a possible schedule under the timestamp protocol (TS (T0) < TS (T1)): T0 read(B ) T1 read(B ) write(B ) read(A ) read(A ) write(A ) What change to the above schedule would make it impossible under the timestamp protocol? T0 read(B ) T1 read(B ) write(B ) read(A ) read(A ) write(A ) Deadlock (S&G 7.1): A situation where two or more processes are prevented from proceeding by the demands of the other(s). Simple example: Process A has the printer and needs a tape drive. It cant release the printer till it acquires the tape drive. Process B has the tape drive and needs a printer. It cant release the tape drive till it acquires the printer. Outline for Lecture 11 I. Deadlock: necessary conditions II. Resource-allocation graphs A. Reusable vs. consumable resources B. Operations on resources C. Suspending a process III. Deadlock prevention 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 106 In general a set of processes is deadlocked if each process is monopolizing a resource and waiting for the release of resources held by others in the set. Deadlock can arise in other contexts: Crossing a river over a path of stones wide enough for only one person. Traffic gridlock . Note that deadlocked processes are not the same as blocked processes. Necessary conditions for deadlock: A deadlock may arise only if these four conditions hold simultaneously (C1) Mutual exclusion. At least one resource must be held in a non-sharable mode. (C2) No preemption. Resources cannot be preempted, but can only be released voluntarily by the processes that are using them. (C3) Hold and wait. At least one process must be holding a resource and waiting to acquire resources that are held by other processes. (C4) Circular wait. There must exist a set of processes {p 0 , p 1 , , p n } such that Lecture 11 Operating System Principles Page 107 p0 is requesting a resource that is held by p1, p1 is requesting a resource that is held by p2, : : pn1 is requesting a resource that is held by pn, and pn is requesting a resource that is held by p0. These are necessary but not sufficient conditions for deadlock to occur. Resource-allocation graphs: Show the relationship between processes and resources. Useful in visualizing deadlock. Definition: A resource allocation graph is an ordered pair G = (V, E ) where V is a set of vertices. V = P UR , where processes, and resource types P = {p1, p2, , pn } is a set of R = {r1, r2, , rm } is a set of E is a set of edges. Each element of E is an ordered pair (pi, rj ) or (rj, pi ), where pi is a process (an element of P), and rj is a resource type (an element of R) An edge (pi, rj ) from a process to a resource type indicates that process i is requesting resource j (a request edge). An edge (rj, pi ) from a resource type to a process indicates that resource j is allocated to process pi (an allocation or assignment edge). Here is an example of a resource allocation graph. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 108 In this graph, r1 r3 P = {p1, p2, p3} R = {r1, r2, r3, r4} E = { (p1, r1), (p2, r3), (r1, p2), (r2, p2), (r2, p1), (r3, p3) } Each dot designates one instance of a resource. Assignment edges point from specific dots. Request edges do not point to specific dots. p1 p2 p3 r2 r4 No cycles exist in this graph. But if we add an edge (p3, r2) Now there are two cycles: r1 r3 and p1 p2 p3 Is there a deadlock here? r2 r4 However, a cycle does not always mean there is a deadlock. In the example at the right, there is a cycle but no deadlock, because r1 p2 p1 p3 r2 Lecture 11 Operating System Principles p4 Page 109 All of the resources we have seen so far are called reusable resources: There is a fixed total inventory. Additional units are neither created nor destroyed. Units are requested by processes, and when they are released, are returned to the pool so that other processes may use them. Some resources, however, are consumable: There is no fixed total number of units. Units may be created (produced) or acquired (consumed) by processes. An unblocked producer of the resource may create any number of units. These units then become immediately available to consumers of the resource. An acquired unit ceases to exist. Examples: Reusable resources Consumable resources In a resource-allocation graph, there is a third kind of edge: A producer edge (rj, pi) from a resource type to a process indicates that process i produces units of resource j. Here is a resource-allocation graph with a producer edge: In this graph, p1 holds 2 units of reusable resource r1, p2 holds 1 unit of r1, p2 requests 1 unit of r2, and p2 is the only producer of r2. 1998 Edward F. Gehringer r1 p1 r2 p2 Page 110 CSC/ECE 501 Lecture Notes Operations on resources: Requests. If pi has no outstanding requests (i.e., if it is unblocked), then it may request units of any number of resources rj, rk, To reflect this in the graph, add edges (pi, rj), (pi, rk), in numbers corresponding to the number of units of each resource requested. Allocations. If process pi has outstanding requests, and for each requested resource rj, the number of units requested the number of units available aj, then pi may acquire all requested resources. To reflect this in the graph, reverse the direction of each request edge to a reusable resource to make it an allocation edge, and remove each request edge to a consumable resource, simulating the consumption of units by pi. In either case, the number of available units aj is reduced by the number of units acquired or consumed by pi.. Releases. If process pi has no outstanding requests (i.e., it is executable), and there are assignment or producer edges (rj, pi), then pi may release any subset of the reusable resources it holds, or produce any number of units of consumable resources for which it is a producer. To reflect this in the graph, remove assignment edges from reusable resources, but producer edges are permanent. Inventories aj are incremented by the number of units of each resource rj released or produced. Lecture 11 Operating System Principles Page 111 Example: Consider the resourceallocation diagram two pages back. Suppose p1 requests one unit of r1 and two units of r2. Then edge (p1, r1) and two edges (p1, r2) appear in the diagram. Since not all p1s requests are satisfiable, p1 is blocked in this state. Now suppose p2s request for one unit of r2 is granted. What happens? r1 p1 r2 p2 r1 p1 r2 p2 Now suppose p2 produces three units of r2. What happens to the diagram? r1 p1 r2 p2 Methods for handling deadlocks: There are basically two methods for dealing with deadlocks. Either never allow it to happen, or when it does happen, try to get out of it. The never allow it strategy can be subdivided into two cases, giving three methods overall. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 112 Deadlock prevention. Disallow one of the four necessary conditions for a deadlock to exist. Deadlock detection and recovery. Try to notice when deadlock has occurred, then take action (usually drastic action!) to remove it. Deadlock avoidance. Rather than disallowing any of the four conditions, dynamically sidestep potential deadlocks. Deadlock prevention : Must disallow one of the four necessary conditions. (C1) Mutual-exclusion condition. Make resources sharable e.g., read-only files. (Not always possible). (C2) No-preemption condition. Make resources preemptible. Two strategies: If a process that is holding some resources requests more resources that are not available, preempt the resources it is holding. The preempted resources are added to the list of resources it was waiting for. If a process requests some resources that are not available, preempt them from another waiting process if there is a waiting process. Otherwise the requesting process must wait. It may lose some of its resources, but only if another process requests them. Problems: Overhead of preemption. Starvation. (C3) The hold-and-wait condition. Again, two techniques can be used. Preallocation. A process requests all the resources it will ever need at the start of execution. Lecture 11 Operating System Principles Page 113 Problems: Expense of unused resources. Lack of foreknowledge, especially by interactive processes. Allow a process to request resources only when it has none allocated. Must always release resources before requesting more. Problem: Starvation more likely. (C4) Circular-wait condition. Impose a linear ordering of resource types. For example, let F (card reader) F (disk drive) F (tape drive) F (printer) = = = = 1 5 7 9 More formally, let R = {r1, r2, , rm } be the set of resource types, and N be the natural numbers. Define a one-to-one function F : R N A process must request resource types ri in increasing order of enumeration F(ri ). A process can initially request any resource type, say ri. Later, it can request a resource type rj iff F(rj ) > F(ri ). [Alternatively, it can be required to release all F(rk ) such that F (rk ) F(rj ).] Proof that this prevents deadlock is easy, follows by transitivity of <. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 114 Deadlock detection: We have seen that a cycle in a resource-allocation graph may mean that there is a deadlock. Outline for Lecture 12 I. Deadlock detection A. Expedient systems B. Graph reduction However, if there is no cycle in the graph, there is no deadlock. How do we know this? II. Special cases for detection To help us, let us define A node z of a graph is a sink if there are no edges (z, b) for any node b. A. Single-unit requests B. Reusable-resource systems C. Reusable-resource systems and ordered request lists D. Single instance of each resource type To use resource-allocation graphs to show that a deadlock exists, we need one more property of the system state: The system state is expedient if all processes requesting resources (that are not available) are blocked. (In such a system, a new allocation of resources can take place only at the time of a request or the time of a release.) If a resource-allocation graph is expedient, a knot is a sufficient condition for deadlock. The reachable set of a node z is the set of all nodes to which there is a path beginning with z. The reachable set of z may contain z if there is a cycle. A knot K is a nonempty set of nodes with the property that, for each node z in K, the reachable set of z is exactly the knot K. To see this, note that each process node in a knot is directly or indirectly waiting for other process nodes in the knot to release resources. Lecture 12 Operating System Principles Page 115 Since all are waiting on each other, no process node in the knot can be granted its request. In considering deadlock detection and avoidance, we often need to (conceptually) remove non-deadlocked processes from the system. We do this by graph reduction. Informally, we can reduce a graph by removing a process (reduce graph G by process pi) if we can show that process pi can continue execution. Reduction by a process pi simulates the acquisition of all of pis outstanding requests, the return of all units of reusable resources allocated to pi, and if pi is a producer of a consumable resource, the production of a sufficient number of units to satisfy all subsequent requests by consumers. Then the new inventory aj of that consumable resource rj is represented by to indicate that all future requests for rj are grantable. Obviously, if we can reduce a graph by all of its processes, there is no deadlock. Formally, a resource-allocation graph may be reduced by a nonisolated node representing an unblocked process pi in the following manner: For each resource rj, delete all edges (pi, rj), and if rj is consumable, decrement aj, by the number of deleted request edges. For each resource rj, delete all edges (rj, pi). If rj is reusable, then increment aj by the number of deleted edges. If rj is consumable, set aj := . Reducing a graph by one process can make it reducible by another process. If the graph can be reduced by all its processes, it is called completely reducible. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 116 If a graph is completely reducible, then all its processes can become unblocked. Therefore If a graph is completely reducible, then it represents a system state that is not deadlocked. Example: Reduce this resource-allocation graph. p r 1 1 p r 2 r 1 p r 2 1 r 1 1 r 2 p 2 p 2 p 2 Let us consider deadlock-detection algorithms for various kinds of systems. Single-unit requests: Suppose a system allows a process to have at most 1 outstanding request for a single unit of some resource. Suppose that the system is expedient. Then there is an efficient algorithm for deadlock detection. Lemma: If the system is deadlocked, then there is a knot in the resource-allocation graph. Proof: Assume that there is no knot. Then each process pi is a sink (i.e., it is not blocked), or a path (pi, rj, pk, , px, ry, pz ) so that node pz is a sink. (The sink must be a process because of expedience; if it were a resource type, then the resource could and would be granted to the process requesting it.) Then pz is not blocked, so we can reduce the graph by pz. Eventually it will finish using its resources, and thus increase the inventory of ry. Then px will no longer be blocked, and we can reduce the graph by px. We can continue up the path in this way, until pi is unblocked. Lecture 12 Operating System Principles Page 117 Since we can reduce the graph by any process pi, there is no deadlock. Therefore, in a single-unit request system, a knot exists the system is deadlocked. Hence, the following algorithm can be used to detect deadlock in such a system Algorithm: Is a state S of a resource-allocation graph a deadlocked state? L := [List of sinks in state S] for the next node N in L do for all F where (F, N) is an edge do L := L || F { Append node F to list L. } deadlock := L {set of all nodes}; Note that the algorithm starts with a list of sinks, and on each iteration, adds the nodes that are on a path of length 1 away from sinks (or nodes that can reach sinks). Therefore, when the algorithm has no more nodes from list L to test, L will contain all nodes that are not in some knot. If there are some nodes that are not in L, then these are the processes and resources participating in a deadlock. A process p is not deadlocked iff it is a sink or on a path to a sink. Thus, Algorithm: Is process p deadlocked? deadlocked := true {until known otherwise}; L := [p]; for the next node N in L while deadlocked do remove N from L; for all F where (N, F) is an edge do if F is a sink then deadlocked := false else if (F is not in L) then L := L || F; 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 118 The list L initially contains only P. It then travels on all paths from nodes of L to neighbor nodes F. If the neighbor is a sink, the algorithm terminates, reporting no deadlock. If the entire list is processed and no sink is found, then p is part of a knot and is deadlocked. Since there are n processes and m resources, the algorithms execute in O(mn) time. The second algorithm runs more quickly than the first. Why? Therefore, the second algorithm can be executed whenever p requests an unavailable resource unit. However, it may still be worthwhile to execute the first algorithm occasionally. Systems with only reusable resources: We will establish that a state is deadlock-free by reducing its resource allocation graph. Note that it does not matter what node(s) we reduce first. Either the graph is completely reducible, or the same irreducible subgraph will be left after the graph has been reduced as much as possible. This follows from the fact that a reduction never decreases the available inventory of a resource. (This is true iff the graph contains only reusable resources.) So if the graph is reducible by, say pi first, and then pj, it could also be reduced by pj and then by pi. Assume that the state S of the system is not deadlocked. Then no process in the system is deadlocked. Thus any sequence of reductions will leave all processes unblocked. Therefore the graph is completely reducible. Lecture 12 Operating System Principles Page 119 Earlier (p. 117) we showed that if a graph is completely reducible, the state is not deadlocked. Thus we have established State not deadlocked Graph completely reducible. This allows us to test for deadlock by checking whether the resource-allocation graph is reducible. The algorithm is very straightforward. Algorithm: Simple algorithm for deadlock detection in reusable systems. L := [List of non-isolated process nodes]; finished := false; while ((L ) and not finished) do begin p := first process in L by which graph can be reduced; if p nil then begin Reduce graph by p; Remove p from L; end else finished := true; end; deadlock := L ; Assume there are n processes and m resource types in the system. What is the time complexity of this algorithm? What is the most resource nodes that can be involved in a reduction? How many iterations of the while loop are (may be) needed? Are there any other sources of complexity? Therefore the complexity of this algorithm is 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 120 How can we make this algorithm more efficient? We need to avoid the search for a process by which the graph can be reduced. Recall: A graph can be reduced by a process p iff the process can acquire all the resource units it is requesting. For each resource, we keep a list (ordered_requests[r]) of the processes that need it. On this list, the processes are ordered by increasing need for the resource. Thus, whenever a resource is released, without searching we can tell which processes needs can be satisfied. For each process, we also keep track of the number of resources it still needsi.e., the number of resource types for which its request cannot be satisfied. This is called wait_count[p], for each process p. Whenever the graph is reduced by a process, that process may release enough resources to satisfy process pis request for other resource types. Whenever that happens, wait_count[pi ] is decremented. (Note: The order in which the graph is reduced represents one order that the processes could finishnot necessarily the order they will finish. Reduction does not involve predicting the future!) The algorithm below uses a list list_to_be_reduced that contains processes that are known not to be deadlocked. Lecture 12 Operating System Principles Page 121 Algorithm: Algorithm for deadlock detection in reusable systems, with ordered request lists. L := [List of non-isolated process nodes]; list_to_be_reduced := [List of non-isolated process nodes whose wait_count is 0]; while list_to_be_reduced do begin Remove a process pi from list_to_be_reduced ; for all rj {resources assigned to pi } do begin Increase available units of rj by the number of units assigned to pi ; for all pk in ordered_requests[rj ] whose request can be satisfied do begin wait_count[pk ] := wait_count[pk ] 1; if wait_count[pk] = 0 then Add pk to list_to_be_reduced; end; end; Remove pi from L; end; deadlock := L ; Since the original deadlock-detection algorithm had only one loop and this has three nested loops, how can this one be faster? Example: Consider a system of three processes, p1, p2, and p3, and three reusable resources, r1, r2, and r3. 1. Initial state p1 holds: wait_count p2 0 p3 0 available in use r1 r2 r3 0 ordered_requests 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 122 2. p1 requests and acquires two units of r3. p1 holds: wait_count r3 r3 p2 0 p3 0 available in use r1 r2 r3 0 ordered_requests The deadlock-detection algorithm is not applied, since all wait_counts are 0. 3. p3 requests and acquires two units of r2. p1 holds: wait_count r3 r3 p2 0 p3 r2 r2 r1 available in use r2 r3 0 0 ordered_requests 4. p3 requests two units of r3. p1 holds: wait_count r3 r 3 p2 0 p3 r2 r 2 r1 available in use r2 r3 0 1 ordered_requests {p3} The deadlock-detection algorithm is applied with L = {p1, p3} and list_to_be_reduced = {p1}. The algorithm terminates in one pass with the conclusion that deadlock does not exist. 5. p2 requests and acquires two units of r1. p1 holds: wait_count r3 r 3 p2 r1 r 1 p3 r2 r 2 r1 available in use r2 r3 0 0 1 ordered_requests {p3} Lecture 12 Operating System Principles Page 123 6. p2 requests two units of r2. p1 holds: wait_count r3 r 3 p2 r1 r 1 p3 r2 r 2 r1 available in use r2 r3 0 1 1 ordered_requests {p2} {p3} The deadlock-detection algorithm is applied with L = {p1, p2, p3} and list_to_be_reduced = {p1}. What does the algorithm do? 7. p1 requests a unit of r1. p1 holds: wait_count r3 r 3 p2 r1 r 1 p3 r2 r 2 r1 available in use r2 r3 1 1 1 ordered_requests {p1} {p2} {p3} The deadlock-detection algorithm is applied with L = {p1, p2, p3} and list_to_be_reduced = . What happens? Since only unsatisfiable requests can cause deadlock, this algorithm only needs to be run at request time. Consequently, it detects deadlock immediately (after the request that caused the deadlock). Therefore, the detection algorithm needs to be run only until wait_count of the requesting process = 0. Single instance of each resource type: The time complexity of the fast deadlock-detection algorithm is O(mn log n). Graphical methods can achieve O(n 2) if there is only one instance of each resource type. Define a wait-for graph, which is a resource-allocation graph, modified as follows: Resource nodes are removed. Edges are collapsed so that an edge (pi, pk ) exists if and only if the resource-allocation graph contained edges (pi, rj ) and (rj, pk ) for some resource rj. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 124 p 5 p 5 r1 r3 r4 p 1 p 2 p3 p 1 p 2 p 3 r2 p 4 r5 p 4 A deadlock exists iff the wait-for graph contains a cycle. Cycles can be detected by O(n 2) algorithm. If deadlock does exist, it can be eliminated by selecting a single victim. Deadlock avoidance (in reusable-resource systems): Requires some additional information on resource requestsfor example, we can require each process to declare the maximum amount of each resource that it will ever need. The concept of a safe state is important. Here, a resource-allocation state consists of the number of allocated and available resources, and the maximum claim (or maximum demands) of each process. A state is safe if the system can Outline for Lecture 13 I. Deadlock avoidance A. Safe states B. Algorithm for deciding safety C. Bankers algorithm D. Single instance of each resource type II. Deadlock recovery A. Combined approach to deadlock handling allocate resources to each process (up to its maximum), and still avoid a deadlock. Example: Suppose the system has three processes p1, p2, and p3, and twelve units of one resource (e.g., tape drives). Lecture 13 Operating System Principles Page 125 State 1: Process Allocated p1 p2 p3 Available: 1 4 5 2 Maximum Need 4 6 8 This state is safe, because requests of be satisfied, in that order. State 2: Process Allocated Maximum Need 10 5 3 , , and can p1 p2 p3 Available: 8 2 1 1 This state is unsafe: regardless of which process is next granted the available drive, we cant guarantee that all 3 processes will finish. Notice that / unsafe state deadlock will occur. It just implies that some unfortunate sequence of requests might lead unavoidably to deadlock. Unavoidably we cant avoid deadlock by choosing to delay some processes resource requests. Unsafe Deadlocked Safe, unsafe, and deadlocked state spaces Safe To avoid entering an unsafe state, we must carefully consider resource requests before granting them. For example, suppose the system is in State I (safe state). 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 126 Now suppose p3 requests and is granted an additional tape drive: State 3: Process Allocated Maximum Need 4 6 8 p1 p2 p3 Available: 1 4 6 1 Why is this now an unsafe state? Example: Assume a maximum-claim reusable-resource system with four processes and three resource types. The maximum-claim matrix is given by 4 3 4 1 1 1 7 1 4 4 12 6 Max = where Maxij denotes the maximum claim of process i for resource j. The total number of units of each resource type is given by the vector (4 8 16). The allocation of resources is given by the matrix 0 2 1 0 1 0 2 0 4 1 1 3 A= where Aij denotes the number of units of resource j that are currently allocated to process i. Processes are numbered p1 through p4, and resources are numbered r1 through r3. Suppose process 1 requests 1 more unit of resource 1. Lecture 13 Operating System Principles Page 127 Then, after the allocation is made, the matrix A is 1 2 1 0 1 0 2 0 4 1 1 3 and the maximum-remaining demand matrix 3 1 3 1 0 1 5 1 0 3 11 3 M a x A is The Avail vector is [0 5 7]. The state is unsafe, because the system cannot guarantee to meet the maximum demands of any process. For example, if all of the processes request an extra unit of resource 1 before any process releases a resource 1, all the processes will be deadlocked. They are not deadlocked now, because they might release their resources before demanding more. Now, suppose process 4 requests 3 more units of resource 3. The processes can finish in the order p4, p2, p3, p1. To see this, note that after the allocation is made, the matrix A is 0 2 1 0 1 0 2 0 4 1 1 6 and the maximum-remaining demand matrix 4 1 3 1 0 1 5 1 0 3 11 0 M a x A is The Avail vector is [1 5 4]. The remaining demand of p4 is Avail, so p4 may complete. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 128 Now Avail is [1 5 10]. so p2 may complete. Now Avail is [3 5 11]. so p3 may The remaining demand of p2 is Avail, The remaining demand of p3 is Avail, Now Avail is [4 7 12]. The remaining demand of p1 is Avail, so p1 may complete. Thus the state is safe. Algorithm for deadlock avoidance (Bankers algorithm): The deadlock-avoidance algorithm is very similar to the deadlockdetection algorithm, but it uses the processes maximum claim rather than their current allocations. Let us introduce a new kind of edge, a claim edge, represented by a dashed line. Like a request edge, a claim edge points from a process to a resource type. r1 p1 r2 p2 A claim edge indicates that a process pi may request the resource rj some time in the future . All claim edges for pi must be present before pi starts executing. Thus, a request edge (pi, rj ) may be added only if a claim edge (pi, rj ) is already present. A resource-allocation graph with claim edges is called a maximum-claim graph. It reflects the projected worst-case future state in resource allocation. A state is safe iff its corresponding maximum-claim graph is deadlock free. Lecture 13 Operating System Principles Page 129 Whenever a process makes a request, the following algorithm is executed: Algorithm: Deadlock avoidance (Bankers algorithm). 1. Project the future state by changing the request edge to an assignment edge. 2. Construct the maximum-claim graph for this state and analyze it for deadlock. If deadlock would exist, defer the request. Otherwise grant the request. Example: Suppose we are given the following system. The processs maximum claims are represented as a claim matrix C, where cij is the maximum claim of process pi for units of type rj. r p 1 1 r 2 C= p 2 1 2 2 2 Suppose p1 requests one unit of r2. Can this request be granted? To see, grant the request, then construct the maximum-claim graph and see if it contains a deadlock: r p 1 1 r r 2 p 1 1 r 2 p 2 p 2 Suppose that process p2 then requests a unit of r1. Can this request now be granted? 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 130 r p 1 1 r r 2 p 1 1 r 2 p 2 p 2 Single instance of each resource type : The deadlock avoidance algorithm can require O(mn 2) operations. If there is only one instance of each resource type, an O(mn ) algorithm can be devised. Algorithm: Deadlock-avoidance with a single instance of each resource type. Assume we are given a maximum-claim graph. Then when a process makes a request, it can be granted only if converting the request edge (pi, rj ) to an assignment edge (rj, pi ) does not create a cycle in the resourceallocation graph. (The detection of a cycle takes time O(mn).) For example, in the diagram below, we cannot grant r2 to p2. Why? r1 p1 r2 p2 p1 r2 r1 p2 In general, deadlock avoidance is more expensive than deadlock detection: The algorithm must be executed for every request prior to granting it. Lecture 13 Operating System Principles Page 131 It restricts resource utilization, so it may degrade system performance. (Assumes that the worst-case request sequence will come true.) This is especially severe if the claims are not precisely known. Deadlock recovery: Several methods can be used. Abort one or more processes to break the circular-wait condition. Considerations: Priority, used computing time, expected remaining time, etc. Preempt resources from one or more processes. Roll back a process. Sometimes also called checkpoint/restart. Snapshots of process state saved at periodic intervals, usually specified by the process itself. To break a deadlock, Process is aborted. Later resumed at the previous checkpoint. Rarely used except for very long processes. Also useful in dealing with system crashes. In general, victims are selected for deadlock recovery on the basis of priority. May also depend on how close a process is to finishing. Combined approach to deadlock handling: Order resource classes (not resource types), e.g., Internal resources (resources used by operating system, including process-control blocks). Central memory. Peripherals (tape drives, printers, etc.). Swappable space (secondary storage required for process). If a deadlock does occur, it must involve only one class of resources. Strategy for recovery depends on which class is involved. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 132 For example, assume that the four resource classes are ordered as shown above, and For internal resources, resource ordering can be used (run-time choices between pending requests are unnecessary). Central memory. Preemption can be used (by suspending the process). Peripherals. In a batch system, avoidance can be used, using information from control cards. Swappable space. Preallocation can be used, if maximum storage requirements are usually known. Deadlockspresent and future: Deadlock is becoming a more important problem, because Distributed processing: Resources may be shared over large networks more possibility for deadlock. Processes will be interactive less possibility of system (or user) knowing resource needs in advance. User-friendly interfaces will not want to bother users for maximum usage information. Large databases possibility of deadlock for portions of a database. Lecture 13 Operating System Principles Page 133 Memory management: Memory is a device for storing information. Its function is to return the value associated with a particular name used by the programmer. Outline for Lecture 14 I. Memory managemt. A. The name space The name space: The programmer refers to values by means of An unfixed collection of logical names. Symbolic rather than numeric. Have no inherent order. Lengths can vary. Same name can be used by other programmers. II. Loaders A. Absolute vs. relocating B. Relocation bits C. Base registers III. Linkers A. Direct vs. indirect B. Two-pass direct linking C. An extended example The name space is the set of names used by the programmers who wrote the code for a single process. The system must provide a mapping between the set of names used by the programmer and the set of values which the program uses. Each name can be mapped (at any given time) to a unique value. Both the software and the hardware play a role in performing this mapping: The compiler (or assembler) translates the name to a logical address. The set of all logical addresses is known as the logical address space. The logical address is translated to a physical address. The set of all physical addresses is known as the This translation may be performed by the linker, the loader, and/or the relocation hardware (and index registers) in the computer itself. This process is known as address translation. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 134 Given a physical address, the memory either retrieves or stores a value. Our discussion of address mapping will focus on two aspects The software that performs this mapping (linkers and loaders). The hardware that performs this mapping (we will consider several computer architectures). Loaders: The mapping of names to values can take place at 4 times: 1. Compile time 3. Load time 2. Link time 4. Execution time 1. The compiler (or assembler) translates a program from a source language to a machine language. 2. The linker takes several separately compiled machinelanguage programs and combines them so that they may be executed as a unit. 3. The loader takes a machine-language program and places it in memory so that it can be executed. 4. At execution time, further translation may take place, if the loader has not produced absolute physical addresses. (There is a good discussion of these in S&G, pp. 240241, though it counts linking as part of the loading process.) Sometimes, steps 2 and 3 are both performed by a linking loader; or steps 1, 2, and 3 are performed by a load-and-go compiler. Also, step 2 is not necessary if the compiled program is to be run without other programs. The simplest kind of loader, an absolute loader, just takes a program from secondary storage, places it in memory, word by word, then transfers control to its first instruction. Lecture 14 Operating System Principles Page 135 What is the most obvious limitation of this kind of loader? In general, the user does not know a priori where the program will reside in memory. A relocating loader is capable of loading a program to begin anywhere in memory: The addresses produced by the compiler run from 0 to L1. After the program has been loaded, the addresses must run from N to N +L1. Therefore, the relocating loader adjusts, or relocates, each address in the program. 0 N L L 1 Object program (logical addresses) N +L 1 Main memory (physical addresses) Fields that are relocated are called relative ; those which are not relocated are called absolute. Which of these are relative? Opcodes Register numbers Direct addresses Shift amounts Immediate operands On the next page is a sample program (Calingaert), together with its object code. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 136 In that program, what address expressions need to be relocated? How about FRONT+7 FRONTFINAL FRONT+FINALLIMIT Address expressions normally may include addition and subtraction, but not multiplication or division by other addresses. Source program Label Opcode Opnd.1 COPY COPY READ WRITE FRONT LOAD ADD STORE SUB ZERO ONE LIMIT OLD OLDER OLD NEW LIMIT Opnd. 2 OLDER OLD Object code Before relocation Loc. Len. Reloc. Text After relocation Machine Loc. code 00 03 06 08 10 12 14 16 18 20 22 25 28 30 32 33 34 3 3 2 2 2 2 2 2 2 2 3 3 2 2 1 1 011 011 01 01 01 01 01 01 01 01 011 011 01 01 0 0 13 13 12 08 03 02 07 06 01 08 13 13 33 35 34 36 38 36 35 36 37 38 30 37 36 35 37 36 40 43 46 48 50 52 54 56 58 60 62 65 68 70 72 73 74 13 13 12 08 03 02 07 06 01 08 13 13 73 75 74 76 78 76 75 76 77 78 70 77 76 75 77 76 BRPOS FINAL WRITE NEW COPY OLD COPY NEW BR FRONT WRITE LIMIT STOP CONST 0 CONST 1 SPACE SPACE SPACE SPACE OLDER OLD FINAL ZERO ONE OLDER OLD NEW LIMIT 00 10 08 38 11 00 01 00 50 08 78 11 00 01 10 Lecture 14 Operating System Principles Page 137 Also, (the number of addresses added) (the number of addresses subtracted) must = 0 or 1. Therefore, FRONT+FINALLIMIT is legal, but FRONT+FINAL+LIMIT and FRONTFINALLIMIT are not Relocation bits: Associated with each address in the program (and sometimes with each opcode too!) is a relocation bit, telling whether it needs to be relocated. 0 No. 1 Yes. The relocation bits may either immediately follow each instruction, or be collected together into a single contiguous relocation map which follows the text of the object code, or a compromise: divide the object code into fixed-length chunks and group as many relocation bits with each chunk as there are potentially relative fields in a chunk of code. What are the advantages and disadvantages of each of these methods? If relocation bits (and length fields) are interleaved with the program text, it cannot be read directly into the storage locations it will occupy. The text needs to be handled in small units of variable length. If the relocation bits are collected in a relocation map, the text and relocation bits can be moved as a block. But a lot of extra memory may be needed to hold the map. With fixed-length chunks, one chunk plus its relocation bits can be moved as a block. Then the relocation bits are used to relocate the current chunk. The relocation bits may then be overwritten by the next chunk. Some computers have base-register addressing, that is, they implicitly add the contents of a base register B to each address: Physical address = logical address + contents of B. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 138 Then, the program normally incorporates instructions to load the base registers with a procedure origin when the procedure is entered. The loader rarely needs to relocate an address, since relocation is performed at execution time. Base-register addressing is sometimes called dynamic relocation, as opposed to the static relocation performed by a loader. If more than one base register is used, the program can address more than one region of memory: Base Registers 4096 8K 12000 16K 24K 28672 32K 0 One advantage of using base registers: The program can be run if there is enough free memory to hold it, even if that memory is not all in one large block. What (few) addresses need to be relocated by the linker or loader when base-register addressing is used? In the program below, note that a base register will be loaded with the value 15, which is the start address of the program. Lecture 14 Operating System Principles Page 139 15 28 53 54 55 56 57 ONE OLD OPTR NEW LIMIT 1998 Edward F. Gehringer ADD ONE FINAL LOAD OPTR Page 140 CONST 1 SPACE CONST A(OLD) SPACE SPACE need to be relocated! Since there are not very many of them, relocation bits are not used. (However, address constants only need to be relocated in architectures that do not automatically relocate all addresses relative to some base, as we shall see.) Instead, the assembler appends a relocation list, consisting of one entry for each address constant. Linkers: If a program has been compiled in more than one part, the parts must be linked together before they can be executed. The problem: Suppose two parts of a program have been compiled separately. Part I uses logical addresses 0, , x1 , and Part II uses logical addresses 0, , x2 . There must be some way to keep them from overwriting each other. Linking of separately compiled procedures or modules can be accomplished in at least two different ways. Direct linking. A piece of software known as a linker fills in addresses of externally referenced symbols. (It fills them in in the code for each procedure.) CSC/ECE 501 Lecture Notes The assembler or compiler has prepared A dictionary (or definition table), which lists each internally defined symbol (1 entry/symbol) that is used by other modules. (symbol name, relative address, relocatability mode) An external-reference list (or use table), which lists each internally used non-local symbol (1 entry/occurrence). (symbol name, relative address) At compile time, translation of externally defined symbols must be incomplete. (They are temporarily assigned address zero and absolute mode.) The action of a two-pass direct linker: Pass 1: Merge the definition tables. Modify the address of each relative symbol. Pass 2: External references are patched by means of the use table. It is also possible to write a one-pass linker, if chains are kept pointing to each symbol as it is encountered. Similar to techniques used in compiler symbol tables. The linker and loader are often combined to form a linking loader, since both need to read all procedures and inspect the relocation bits. To summarize, A linker prepares relocatable code with interspersed relocation information. A linking loader produces executable code with no insertions. Example: Consider the two modules on the next page. The first uses two external references (defined in assembly code via INTUSE) and defines four external symbols (via INTDEF). Lecture 14 Operating System Principles Page 141 The second uses four external references and defines two symbols that will be referenced from outside. Source Program PROG1 PROG2 TABLE START INTUSE INTUSE INTDEF INTDEF INTDEF BRPOS LOAD LOAD STOP CONST END 0 TEST RESUMEPT HEAD PROG2 LIST TABLE+2 Object Code TEST RESUMEPT HEAD LIST 37 Addr Word 10 01 12 03 14 20 03 22 29 11 30 37 31 M Word a a a a a 00 30 02 M a r a Use Table Symbol PROG2 TABLE Definition Table Symbol PROG1 TEST RESUMEPT HEAD Addr S i g n 11 21 + + Addr M o d e 00 10 12 22 r r r r Source Program PROG2 TEST RESUMEPT HEAD START 0 INTDEF TABLE INTUSE INTUSE INTUSE STORE TABLE+HEADTEST BR RESUMEPT SPACE SPACE SPACE CONST 2 CONST A(TEST) END Object Code TABLE TWO ADDRTEST Addr Word 15 07 17 25 00 27 XX 28 XX 29 XX 30 02 31 00 32 M Word a a a a a a a 27 00 M r a Use Table Symbol Addr Sign HEAD 16 + TEST 16 RESUMEPT 26 + TEST 31 + 1998 Edward F. Gehringer Definition Table Symbol PROG2 TABLE Addr 00 27 Mode r r CSC/ECE 501 Lecture Notes Page 142 A global symbol table is prepared by pass 1 (below). Then external references are resolved during pass 2. Second segment after Pass 1 of linker: Source Program PROG2 TEST RESUMEPT HEAD START 0 INTDEF TABLE INTUSE INTUSE INTUSE STORE TABLE+HEADTEST BR RESUMEPT SPACE SPACE SPACE CONST 2 CONST A(TEST) END Object Code TABLE TWO ADDRTEST Addr Word 46 07 48 56 00 58 XX 59 XX 60 XX 61 02 62 00 M Word a a a a a a a 58 00 M r a Use Table Symbol Addr Sign HEAD 47 + TEST 47 RESUMEPT 57 + TEST 62 + Global Symbol Table Symbol HEAD PROG1 PROG2 RESUMEPT TABLE TEST Addr 22 00 31 12 58 10 Mode r r r r r r Indirect linking. Essentially the same, except that pointers into the external-reference list, rather than addresses, are placed in the code. The linker places absolute addresses into the external-reference list. External references proceed indirectly, through the external-reference list. Disadvantages: References slower, but perhaps this is not too serious if only addresses of called procedures need to be referenced in this way. The external reference lists must be kept around at execution time. Lecture 14 Operating System Principles Page 143 Advantage: Linking is much faster, since only the external reference lists need to be modified, not the code. (But advantage wanes if a linking loader is used.) Program after Pass 2 of linker: Source Program PROG1 PROG2 TABLE START 0 INTUSE INTUSE INTDEF TEST INTDEF RESUMEPT INTDEF HEAD BRPOS PROG2 LOAD LIST LOAD TABLE+2 STOP CONST 37 START 0 INTDEF TABLE INTUSE INTUSE INTUSE STORE TABLE+HEADTEST BR RESUMEPT SPACE SPACE SPACE CONST 2 CONST A(TEST) END Object Code TEST RESUMEPT HEAD LIST PROG2 TEST RESUMEPT HEAD Addr Word 10 01 12 03 14 20 03 22 29 11 30 37 M Word a a a a a 31 30 60 M r r r TABLE TWO ADDRTEST 46 48 56 58 59 60 61 62 07 00 XX XX XX 02 10 a a a a a a r 70 12 r r Dynamic linker. It is often preferable to delay linking certain modules or procedures until execution time. (E. g., error-handling routines.) Usually used with segmented virtual memory. We will describe dynamic linking after we describe segmented memory. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 144 Memory models and computer architectures: We will describe several architectures that have been developed to allow more than one program to share memory, and discuss the problems of memory management in each. Outline for Lecture 15 I. Architecture I. Unprotected address space II. Architecture II. Partitions within main memory Swapping Architecture I: Basic von Neumann architecture. Represented most recently by 8-bit microcomputers. The address space is of fixed size; no address translation. (Logical address = physical address.) Hence, any process can write anywhere in physical memory. III. Architecture III Two segments/process IV. Architecture IV Segmented v.m. Addressing Address translation May have index registers, which implement the simplest kind of address modification. They are needed for addressing array elements. An indexed address is a pair (i , x), where i denotes an index register and x denotes the displacement. Then Physical address = (contents of index register i ) + x Problems not solved by the basic von Neumann architecture: Size of operating system. When the size of the OS changes, all programs may have to be recompiled. Size of address space. Programs cannot be larger than physical memory, since there is no way to generate an address larger than the maximum physical address. Protection. A program can write anywhere in physical memory. Sharing of code. Two users are running the same code simultaneously. How are they to be kept from overwriting each others data? Lecture 15 Operating System Principles Page 145 The first of these problems can be solved by partitioning main memory into several regions. Architecture II: Partitions within main memory. Each program runs within a partition of memory. In the simplest case, there are just two partitions, one containing the OS, and the other containing a single user program. Usually the OS is at the low end of memory, because that is where the interrupt vector is. In a more general case, there can be more than one user program. Each program has associated with it a single baselength register pair. The example below shows three programs occupying main memory. 0 0 15000 40000 0 Process 1 Process 1 Process 2 30000 0 55000 70000 Process 3 35000 Process 3 105000 Three Addresslogical mapping address function spaces Process 2 135000 One physical address space Typically, the contents of the base register are added to each logical address. (Note: The base register is loaded by the operating system.) 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 146 Q base register R length register Physical address = contents of Q + contents of index register +x If the logical address is greater than the contents of the length register, it is an error, and the requested memory access is not performed. if contents of index register + x contents of R then error. Hence this architecture provides protection between processes. From now on, we will forget about index registers in giving memory-mapping functions; they always take part in the computation of a logical address, but for simplicity, we omit the details. Architecture II still has these limitations: It is still impossible to share programs or data. It is still impossible to protect one module of a program from another. Though many programs can fit in physical memory at once, no single program can be larger than physical memory. Swapping (S&G, 8.3): In a multitasking system, several processes may be resident simultaneously in main memory. The processor takes turns working on each one. But some processes may be idle for a long time (if, e.g., the user walks away from his terminal). Such processes waste space in memory that could better be used by other processes. So the solution is to swap out the idle processes to secondary storage. Swapping of processes is very similar to swapping of segments in virtual memory, and we will treat it in more detail later. But three special aspects of process swapping should be noted. Lecture 15 Operating System Principles Page 147 Can a process be swapped out while waiting for input/output? Can swapping be done if the program has been relocated by a loader instead of by base-length registers? Suppose a process is only using part of the physical memory that is allocated to it. How can swapping be made more efficient? Nowadays, process swapping is used mainly on PCs. In Microsoft Windows, when a new process is loaded and there is not enough memory, an old process is swapped to disk. The user decides when to bring it back. Windows/NT, however, supports paging. Physical memory can be managed like a single large array, using any appropriate memory-management strategy (a few of which we will consider later). 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 148 Architecture III: Two segments per process. Same as Architecture II, except for a hack to allow sharing: Each process is given two base-length registers, (Q0, R0) and (Q1, R1). The first register points to a non-sharable partition into which the program may write. The second register points to a sharable partition that contains code and (sometimes) read-only data. Q0 R0 Data Q1 R1 Code A bit in the address field of each operand can select which partition is to be used for each reference. Still, programs cant be larger than main memory. Example: The DEC-10 kept code and read-only data in high memory if the leading bit of an address was 1, the contents of Q1 (in the diagram above) were added to the logical address to find the physical address. If the leading bit of an address was 0, it referred to non-sharable low memory, where read/write data is kept. Example: The Univac 1108 had a similar scheme, but did not allow read-only data to be kept with code. Instruction references referred automatically to memory in the sharable region; data references referred automatically to memory in the non-sharable region. Thus the Univac 1108 did not need an extra bit in each address. Architecture IV: Segmented virtual memory. To run programs which are larger than physical memory, there must be some way of arranging to have only part of a program in memory at one time. Lecture 15 Operating System Principles Page 149 Any operating system which allows a program to be larger than physical memory is called a virtual-memory system, because it allows a program to use primary memory which is not physically present. To achieve virtual memory, we generalize Architecture III to allow more than one sharable and/or non-sharable segment. It is possible for a process to share portions of its code (e. g., I/O routines) and its data (e. g., databases) without sharing all code or all data. The segments can now be swapped independently; not all need to be in memory at once. Since swapping of segments is now independent of swapping of programs, the question of when to swap something or other becomes more complicated. With more entities being swapped, memory fragmentation becomes more serious. Swapping rules: Read-only segments need not be copied out to secondary storage. Segments which have not been written into since being swapped in need not be swapped out either. Uninitialized segments need not be swapped in. Addressing in segmented memory: A logical address is of the form (s, d), where s is the segment number and d is the displacement. s d The number of segments may be large; hence there are no segment registers, but rather a segment table. The segment table contains segment descriptors, each of which hold a base address (physical address of first word or byte), a length field (giving length in bytes or words), and an access-rights field (read, write, and perhaps execute ). 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 150 Here is a diagram of addressing using a segment table. Segment-table base register b Add Logical address s d > Compare s Add b l r Physical address Segment table for current process Address translation: The nave method would be to consult the segment table each time a memory reference is to be performed. Instead, some form of content-addressable, or associative memory is usually used. This memory is smaller than a segment table. The input is compared against all keys simultaneously. If the input matches a key, then the corresponding value is output. Key k1 k2 Value v1 v2 Input Output kn Usually this memory is called a translation lookaside buffer, or TLB, because the processor looks aside to it as it is translating an address. The next page shows a diagram of address translation using a TLB. vn Lecture 15 Operating System Principles Page 151 Each time the TLB is accessed, this procedure is followed: If an entry matching the segment number is found, the length is checked, and the base address returned. If no matching entry is found, a victim entry is chosen from among the TLB slots. The victim is chosen to be an entry that is not very active. We will discuss what this means later. A TLBs hit ratio is the number of entries found / the number of times searched. Segment-table base register b Add Logical address s d Try this first sbl r Only if match in TLB Translation-lookaside buffer (most active segments only) s b l r If no match in TLB Add Physical address Segment table for current process The TLB does not have to be very large to produce a good hit ratio. Early studies showed Burroughs B6700 98% with 8 cells. Multics 98.75 with 16 cells. However, this was 20 years ago, when programs were much smaller. Todays TLBs typically have from 128 to 2048 entries. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 152 Let a be the time to search the TLB, m be the time to access main memory, and h be the hit ratio. Then the effective access time is h (a+m ) + (1h )(a+2m ) = a + (2h )m (This formula only applies in a one-level segment table.) Segment faults: When a segment is referenced, it may not be in main memory. Each segment descriptor contains a presence bit , which tells whether or not the segment is present. (For simplicitys sake, the presence bit was not shown in the diagrams above.) If a segment is not in main memory when it is referenced, it must be swapped in. This is called a segment fault. Usually, it will necessitate swapping some other segment(s) out. Thus the segment descriptor must also contain a secondary-storage address. How many fields does a segment descriptor contain altogether? Lecture 15 Operating System Principles Page 153 Sharing of segments (similar to S&G, 8.5.5): If several processes are using the same code, it is advantageous for all processes to use just one copy. In a time-sharing system, programs like editors are frequently shared. S&G (p. 271) gives an example of three processes sharing a text editor. Memory is laid out as shown below. Note that the processes share code, but each has its own data. Process P 1 Outline for Lecture 16 I. Sharing segments The problem Separate segment tables Linkage sections II. Dynamic loading III. Dynamic linking Advantages of Extra info. in each segment Refs. via linkage secs. How to link a segment in Why use linkage sections? Process P 2 editor What access rights should each of the pages have? editor I/O library file-systeml ibrary data 1 Virtual memory 3000 4000 6000 1000 Segment table I/O library file-systeml ibrary data 2 Virtual memory Segment address 0 1000 data1 2000 data 3 3000 editor 4000 5000 I/O library 3000 4000 6000 7000 Segment table However, sharing is harder than it looks! Shared references must map to the same virtual address in all processes, because Process P 3 editor I/O library file-systeml ibrary data 3 3000 4000 6000 2000 Segment table Virtual memory Code segments contain references to other instructions, e.g., CALL instructions. An instruction might be CALL 4, 125 meaning call the procedure at segment 4, byte 125. This means that the procedure must be in segment 1998 Edward F. Gehringer 6000 file-system library 7000 data 2 8000 Physical memory Page 154 CSC/ECE 501 Lecture Notes Code segments also contain references to data, like LOAD R1, (2, 321) meaning load R1 from byte 321 of segment 2. Again, the data segment must be segment The segment-table organization shown above wont suffice for sharing in general. Why not? There are several ways of solving this problem. For now, we will mention two. Give a process two separate segment tables, a user segment table, whose segments are not sharable, and where any segment can have any segment number. 3000 4000 6000 System segment table (shared) 1000 User segment table for P1 (private) 7000 User segment table for P2 (private) One bit in each address tells which segment table to use. Use indirection. Instead of containing segment numbers, the code just contains pointers to segment numbers. These pointers point to linkage sections, which contain the true segment numbers. Linkage sections for P1 P2 P3 1 2 3 5 1 2 3 4 1 2 3 6 Segment table 3000 4000 6000 7000 1000 2000 This is the approach taken by Multics dynamic linking, which we consider below. Dynamic loading (S&G 8.1.2): With segmented memory, a process can begin execution before all the modules or procedures it may call are loaded. When a routine needs to call another routine, the calling routine checks to see whether the callee has been loaded. Lecture 16 Operating System Principles Page 155 If not, the loader is called to load the desired routine into memory and update the programs address tables to reflect this change. Then control is passed to the newly loaded routine. Dynamic linking (S&G 8.1.3): If dynamic loading is used, linking must still take place in advance. This means that copies of code libraries are linked in with the object code of different applications in advance. With dynamic linking, however, incorporation code of libraries into the executable image is deferred until the program is loaded. What is the advantage of this? The dynamic-linking strategy described below was used on the Multics system in the late 60s and 70s. It is of renewed interest with the coming of the Java class loader. Each module or procedure is found in some segment. External references to code or data are handled differently from other systems. External references are compiled as indirect references, just as in indirect linking. But unlike in indirect linking, the address of the external reference is not stored in the external-reference list itself. (The reason for this is to allow sharing, as we will see later.) To expedite dynamic linking, extra baggage is attached to each segment. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 156 The dictionary and external-reference list for each segment are placed at the beginning of that segment . Each segment also contains a template for linkage sections, which are described in greater detail below. In each entry in the external-reference list, the compiler places the symbolic name* of the referenced symbol, the symbolic name of the segment in which it is found, and the displacement from the symbol that is to be referenced. For example, in a reference to seg|[mysym]+5 Segment Dictionary Ext.-ref. list Code and/or data Template for linkage sec. External reference list seg is the name of the segment, mysym is the name of the symbol, and 5 is the displacement. seg mysym 5 The reference is to five words past the symbol mysym in segment seg. Segment Dictionary External ref. list Code/data External-reference list Segment name Symbol name Displacemt. Link-sec. tmp. There is a separate linkage section for each (process, code segment) combination. *Actually, pointers to the symbolic names are used to save space. Lecture 16 Operating System Principles Page 157 [The linkage section is created by copying the linkagesection template when the process first uses the segment.] Consider the linkage section for (process p, code segment c). The linkage section contains one entry corresponding to each entry in the external-reference list (the externalreference list of the code segment b in which process p is now running). External references are made indirectly through the linkage section (similar to indirect linking). (Multics has an indirect addressing mode.) The code (in segment b) contains an index into the linkage section. The linkage section contains the segment number of the referenced segment (say, segment c). The segment table contains a descriptor for the segment (of course). Segment b Dictionary Ext. ref. list Segment table for process p Externally referenced segment c Code for procedure A Linkage sec. for process p, procedure A c d Link. sec.temp How are segments linked in dynamically? 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 158 The first time an external reference is made to code segment c, a fault occurs. This fault is similar to a segment fault, except that the dynamic linker is invoked instead of the segment-fault handler. The dynamic linker looks up the segment name in a filesystem directory and fills in the segment-number field. [Actually, the first time a process refers to a segment seg, an entry for seg is made in the processs Known Segment Table (KST). So if the process has ever used this segment before, the file system need not be consulted.] Known-Segment Table (KST) Segment name Segment # The dynamic linker looks up the symbol name in the dictionary of the newly linked segment and calculates the displacement for the external reference (i.e., adds the displacement field from the external-reference table + the displacement given by the dictionary for this symbol). The resulting value is the d portion of the linkagesegment entry. The first reference to a segment probably causes a segment fault (if no other process was using that segment already). Why are linkage sections necessary? To allow sharing! Suppose two processes running different programs want to share a segment. Lecture 16 Operating System Principles Page 159 Suppose process P1 uses modules A, which has 5 segments, and B, which has 4 segments. and process P2 uses modules D, which has 16 segments, and B . Suppose also that P1 uses B before P2 does. What problem will arise? Segment table for P1 Code within segment B 0 4 Segment table for P2 sd 8 12 16 20 How do linkage sections solve this problem? . For efficiencys sake, the linkage section can be bypassed by loading an external address into a processor register. For example, if a process referred to an externally referenced segment which turned out to be segment 5, offset 17, it could load the (5, 17) into a processor register to access that segment without going through the linkage section each time. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 160 Architecture V: Paging. Pages are fixed-size segments. They are of length 2k (bytes, words, etc.) In main memory, a page occupies a fixed-size page frame. Page tables are similar to segment tables, except that no length field is needed in a page descriptor. Outline for Lecture 17 I. Arch. V: Paging Page-size considerations Paging vs. segmentation II. Arch. VI: Mutilevel page tables III. Arch. VII: Paging + segmentation Address translation Descriptor registers TLBs work as with segmented memory, except that no length field is needed in their entries. Note that physical addresses are formed by concatenation, rather than by addition. Page-table base register Logical address b Add p d p f f r d Physical address Page table for current process Page size (S&G, 9.8.2): This is an architectural question that is affected by programming considerations. On average, a program half-fills its last page. Thus, for a page size of n words, an average of n2 words will be wasted in the last page. This is called internal fragmentation. As page size grows larger, internal fragmentation grows more serious. Lecture 17 Operating System Principles Page 161 With a large page size, large amounts of possibly unneeded information are paged in with the needed information. Suppose a 20 1000 array is stored by rows: A [0, 0], A [0, 1], , A [0, 999], A [1, 0], , A [1, 999], 0 0 1000 A [0,0] A [1,0] 19000 A [19,0] 1 A [0,1] A [1,1] A [19,1] 999 A [0,999] A [1,999] A [19,999] If a program performs calculations on the first fifteen columns, it will use 20 regions of fifteen words each, separated by 985 unused words. Assume page size is 2048 words. Since there are no gaps of 2048 words, all 10 (or 11) pages comprising the array will be brought into main memory. If page size is 128 words, then even if each region of fifteen referenced words crossed a page boundary, a maximum of 2 pages/row, or 40 pages, totaling 5120 words, would be brought in. However, often the number of words needed in main memory at one time depends on the organization of the array. See S&G, 9.8.4. As page size grows smaller, the page table grows larger. A larger and more expensive TLB is needed. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 162 Small pages make inefficient use of disks. It takes a long time for a disk to move its arm to the right cylinder, and then a much shorter time to transfer a page into main memory. We say that transfer time << seek time. In addition, larger pages decrease the number of page faults, and thus decrease system overhead. Page sizes in several computer systems: Manufacturer Model Honeywell IBM IBM DEC DEC Intel Motorola Intel Page Size Unit 36-bit words 32-bit words 32-bit words 36-bit words 8-bit bytes 8-bit bytes 8-bit bytes 8-bit bytes Multics 1024 360/67 1024 370/168 1024 or 512 PDP 10/PDP 20 512 Vax/11 512 80386 4096 68030 256 to 32K Pentium 4 K to 4M In the 1970s, page size fell, due to reasons associated with locality of reference. Since the 1980s, page size has been rising as increases in CPU speeds and memory capacity have been outpacing rises in disk speed. Paging vs. Segmentation Advantages of paging. 1. No external fragmentation of main memory. (Funny-size holes dont fit together well.) 2. No external fragmentation of secondary storage. 3. No bounds checking needed. This saves a small amount of time per reference. More importantly, the length information doesnt need to be stored in the TLB, so a TLB entry can be smaller. Lecture 17 Operating System Principles Page 163 4. No need to bring in a large segment just to reference a few words. Advantages of segmentation. Page 1. No internal fragmentation of memory. In paged memory, a U s whole page must be allocated and Unused e transferred in even if only a few d words are used. 2. Easier to handle data structures which are growing or shrinking (because maximum segment size is usually much larger than will ever be needed). In paged memory, if a Process process has many stack data structures that can grow or shrink Stack 2 (e. g., the process stack, another stack, Currently a queue, etc.) the Queue in use structures may run for queue Allocated into each other: to queue Code but not now in use In segmented virtual memory, each data structure can be allocated its own segment, which can grow (or shrink) if necessary. 3. Facilitates sharing. A segment that is shared need not be any larger than necessary. In paged memory, by contrast, each shared item must occupy a set of pages, even if it uses just one word of the last page. 4. Array bounds can be checked automatically. What is the advantage of this? 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 164 Architecture VI: Multilevel page tables. In recent years, the virtual-memory requirements of processes have been growing rapidly. The Intel Pentium Pro, for example, has a maximum physicalmemory size of 236 and a maximum virtual-memory space of 246. Suppose that pages are 210 bytes long, but the virtual address space is 232 bytes. Suppose that a page-table entry occupies 8 bytes. How large would the page table be? Clearly, it is not possible to keep the entire page table in main memory. We can instead arrange for the page table itself to be paged. Then a virtual address has the form (p1, p2, d) where p1 is a page-table number, p2 is the page number within that page table, and d is the displacement. Virtual Address Page number Displacement Outer page table p1 p2 d p1 An inner page table p2 Another inner page table Frame No. Displacement A two-level page table is appropriate for a 32-bit address space. Lecture 17 Operating System Principles Page 165 If the outer page table (see the diagram above) would be larger than a page, a third level can be used. In practice, page tables of up to four levels have been used. The SPARC architecture supports a three-level page table and a 32-bit address space. The M68030 supports a four-level paging scheme. Architecture VII: Paging and segmentation combined. Architecture VI has assumed that all of the (lowest-level) page tables are a full page long. Suppose we relax this requirement to allow some of the tables to be shorter than a page. Then we can call the set of pages pointed to by a lowest-level page table a segment. Each segment is divided into one or more pages. Avoids the storage-allocation problems posed by excessively large segments. An address has the form (s, p, d), where s is the segment number, p is the number of a page within the segment, and d is the displacement within the page. Each segment descriptor points to a page table. Each page descriptor in the page table points to a page. The TLB usually contains entries of the form ((s, p), f ) where f is a page-frame number. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 166 Seg. table base reg. 0500 + 0500 Seg. Page Displacemt. 19 01 327 0519 0700 Segment table + 0700 0701 35 Page table Concat. 35000 35327 Page Thus, the TLB only needs to be searched once in case of a hit. In case of a miss, the segment table and then the page table must be consulted. An exampleOS/2 (S&G 8.7.2): OS/2 runs on Intel 386 and 486 architectures. The 386 uses paged segments: A process may have up to 16K segments. A segment may have up to 232 byes. The virtual address space of a process is divided into two partitions: 8K segments that are private to the process. Information about this partition is kept in the Local Descriptor Table (LDT). 8K segments that are shared among all processes. Information about this partition is kept in the Global Descriptor Table (GDT). Each entry in the LDT and GDT consists of 8 bytes. It contains, e.g., base and length of the segment. A logical address is a (selector, offset) pair. A selector has this format: s 13 g 1 p 2 where s designates the segment number, g indicates whether the segment is in the GDT or LDT, and p specifies protection. How are segment descriptors accessed? The 80386 doesnt use a conventional TLB to map segments. Lecture 17 Operating System Principles Page 167 Instead, an address in a program refers to one of six segment registers. These registers are loaded by the program. Among them are CScontains the segment selector of the Code Segment DScontains the segment selector of the Data Segment SScontains the segment selector of the Stack Segment EScontains the segment selector of an Extra Segment Access Segment base rights address 7 0 23 Segment size 0 15 0 Segment selectors 15 0 47 40 39 16 15 0 Segment registers (loaded by program) Segment-descriptor cache registers There are also four segment-descriptor cache registers, which are loaded by the system (above). A segment-descriptor cache register is loaded by this sequence of actions: The program places a segment-table index (a selector) in the appropriate segment register. The processor adds the selector value to the base address of the segment table, to select a segment descriptor. The processor copies the descriptor to the corresponding cache register. On accesses to the segment, the cache register, rather than the segment table, is consulted. This scheme is not as fast as a TLB, because the cache registers must be reloaded every time one of the segments changes. (The code segment, for example, usually changes each time a procedure call occurs.) Using the segment registers, the segments base and limit are used to generate a linear address. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 168 The limit is used to check for address validity. If the address is not valid, an exception is generated. If it is valid, the base and offset are added, resulting in a 32bit linear address. The linear address is then translated into a physical address. How large is the page table for each segment? How do we determine that? Thus, page tables need to be paged. A linear address is divided into A page number consisting of 20 bits. A page offset consisting of 12 bits. Page number Displacement p1 10 p2 10 d 12 Here is a diagram of 80386 address translation. Lecture 17 Operating System Principles Page 169 Logical offset selector displacement Descriptor table (registers) seg. descriptor + Linear address directory page displacement Page frame physical address Outer page table Inner page table p 1 p directory entry 2 page-table entry Segment reg. Outline for Lecture 18 The problem of shared references (S&G 19.6): How can processes share segments or pages? (R. S. Fabry, Capability Addressing, CACM, July 74, pp. 403-412) Consider a process that calls a procedure: I. Capability addressing Sharing & protecting data Uniform-address solution Indirect-eval. solution Mult. segment tables Capability addressing II. Protecting capabilities Partitioned memory Tagged memory Fenced segments Encryption The mapping table III. Arch. IX: Single virtual addr. space IV. Protection rings 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 170 Data PC Process 0 The process needs to call the procedure in segment 1, and read and write the data in segment 2. Segment 2 contains private data, which must not be shared among different processes that execute the procedure. Notice that the program counter (PC) points to somewhere in segment 0. 0 1 2 re re rw Main Call 1 Access 2 Procedure If two processes want to share the procedure, each must use a different set of data: If process 1 and process 2 are running different programs, they will not necessarily use the same segment numbers for the procedure segment or the data segment. For example, in the code at the right Process 1 knows the procedure as segment . Process 2 knows the procedure as segment . How can the code be made to work for both processes? We will consider four different solutions. 0 1 2 Data for Process 1 Process PC 0 1 re re rw Main Call ? Access ? Seg. table for Process 1 Procedure Process PC 2 2 0 1 2 re rw re Data for Process 2 Seg. table for Process 2 Lecture 18 Operating System Principles Page 171 Uniform-address solution: Each shared integer segment number must be interpreted in a functionally equivalent manner. Some segment numbers refer to private objects, and others to shared objects. Each shared segment number is dedicated to the same use in all processes that share it. Thus, functions of shared segment numbers must be defined centrally. Makes modularization difficult. This solution has been used by the Burroughs B6700 and the Medusa operating system for the Cm* multiprocessor. 0 1 2 Process 1 re re rw 0 PC Data for Process 1 Main Call 1 Access 2 Segment table for process 1 Process 2 0 1 2 re re rw 0 PC Procedure Data for Process 2 Segment table for process 2 Indirect-evaluation Linkage solution: Indirect section for linking (possibly process 1 dynamic linking) is for Main used; processes make Linkage all references indirectly section for process 1 via linkage sections. for Proc 0 1 2 0 1 2 1 4 0 2 0 1 2 3 4 3 Process Base 1 1 PC rw re re r r Data for Process 1 When process 1 executes the code for FETCH 2, word 2 of Linkage the linkage section section for for process 1s main process 2 program is fetched. for Main Linkage It contains 0, so section for process 1s process 2 for Proc segment 0 is used. A CALL calls the linkage section whose 0th entry points to the code. There is a linkage section for each separately compiled module used by a process. 1998 Edward F. Gehringer Main Segment table for process 1 0 1 2 0 1 2 2 1 4 3 Call 1 Access 2 0 Process Base 2 2 PC 0 r 1 r 2 re 3 re rw 4 Segment table for process 2 Procedure Data for Process 2 CSC/ECE 501 Lecture Notes Page 172 A base register points to the current linkage section. The contents of this register must be changed at the time of (inter-module) procedure calls. For example When process 1 executes CALL 1, the entry in word 1 of process 1s main program linkage section is fetched. It contains 4, which is then loaded into the linkagesection base register. Word 0 of the new linkage section is fetched. By convention, this register always points to the procedure code. It contains 2, which is then placed in PC. Although linkage-section indices are used for all ordinary addresses, if a segment is to be passed as a parameter, a segment number must be passed instead (since called procedure cannot consult linkage section of caller). Disadvantages: Extra space needed for linkage sections. More segments increased swapping overhead. Takes time to initialize linkage sections. Extra indirection in address translation. 0 Seg. table 1 for process 1 2 for Main Seg. table 0 for process 1 1 2 for Proc re st rw Data for Process 1 Multiple segment-table solution: There is a different segment table for Process 1 each procedure and each process. Problem: Difficult to pass segment references as parameters. (What are you going to use for a segment number?) Process 2 Problem: Overhead of all those segment tables. Base PC re Main Call ? Access ? Base PC 0 Seg. table 1 for process 2 2 for Main Seg. table 0 1 for process 2 2 for Proc re st rw Procedure re Data for Process 2 Lecture 18 Operating System Principles Page 173 Architecture VIII: Capabilityaddressing solution. We can do away with all those segment tables if we can, in effect, put segment descriptors in any ordinary 0 data structure. 1 2 Process 1 rw PC Data for Process 1 Main 0 re Access 0 Such descriptors are called capabilities. They are protected pointers to objects. Since they are protected from modification by users, they can be held in data structures or processor registers, or on the process stack. So it is easy to pass capabilities as parameters. Registers for Process 1 Procedure PC Process 2 0 1 2 rw Data for Process 2 Registers for Process 2 Means of protecting capabilities: Partitioned memory. There are two kinds of segments (and registers), capability segments and data segments. Capability segment Capability 1 Capability 2 Data segment Word 1 Word 2 Capability n Word m Disadvantage: Overhead of extra Tagged memory. Each memory word contains a one-bit tag-- 0 means data, 1 means capability. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 174 Data word Capability 0 1 This is an attractive solution for a word-addressable machine. Fenced segments. Each segment can hold both data and capabilities. It has a fence separating data in one portion from capabilities in the other. Segment Capabilities Segment descriptor base fence length Data Encryption. When a capability is created by the kernel, it is encrypted by an algorithm that maps an unencrypted capability to a much longer encrypted version. Unencrypted capability Before a capability is used, hardware (or microcode) decrypts it. If the result is not a legal capability, an exception is raised. Encrypted capability Disadvantages: Advantages: Lecture 18 Operating System Principles Page 175 The mapping table: How does a capability point to an object? Here is one method: All of the segment tables are replaced by a single systemwide object table, which contains object descriptors. A capability contains a name field. Name Rights This name field tells where in the object table to find the object descriptor (just like a segment number tells where to find a segment descriptor in a segment table). Legend i register # d displacement n object name r access rights b base f fence l length t abstract type p presence Add Physical address Logical address i n d r Register holding capability > Compare n b f l t p Object table This diagram shows an access to the data portion of an object. An access to the capability portion is similar, except that The base and fence are added to the displacement to yield the physical address. The displacement is compared with the length instead of the fence. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 176 If the name field is large (50 bits or so), it will never be necessary to re-use object names. However, the object table will have to be multilevel, like the multilevel page table of Architecture VI. Sophisticated means are used to manage the object table (e.g., hashing of object names) A TLB can be used to hold recently accessed object-table entries. (Just like in paged or segmented memory, except that association is on the object name instead of the page or segment number.) An architecture that uses capabilities to control access can be structured according to the object modeleach process has capabilities only for the procedures and data structures it needs to access. However, if objects are very small and numerous, the object table can get to be very large (maybe 1/4 or 1/3 of memory!). But it is possible to implement objects without a central object table. Architecture IX: A single large virtual address space. With paged and segmented memory, each process has its own virtual address space. Suppose we decree that there will be only one virtual address space in the entire system? Virtual address space must be very large (e. g. 264). Virtual address space must be allocated to processes (just like physical memory must). An object name is just its virtual address. (Of course, the object names must be protected by placing them in capability segments, tagging or encrypting them, etc.) Lecture 18 Operating System Principles Page 177 Two processes that share the object will use the same virtual address for it. Memory is paged; several objects can fit on a page, or an object can span several pages. Object Object A B Page 1 Object C Page 2 Object D Page 3 Now an object table is not needed, just a page table. Objects can be much more numerous, and hence smaller. A multilevel page table will be required, with about 4 levels. In general, the pages used by a process may be scattered throughout the virtual address space. Hence, very few entries in each lowest-level page table may be in use. So, the page table is implemented as a hash table (since page numbers are sparse). The object length (and perhaps the object type) is stored at the beginning of each object instead of in the object table. Capabilities are one way to provide protection in hardware. Here is another one Protection Rings: A generalization of supervisor/user modes. Instead of only two states, user and supervisor (monitor), there are N rings, numbered 0 to N 1. Processes running in ring 0 are the most privileged. Segments as well as processes have ring numbers. A process running in ring i may access segments belonging to rings i to N1 (providing it has a segmenttable entry for them). A procedure may call code segments in more privileged rings only through well defined entry points known as gates . 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 178 A call from a process in ring j > i must be directed to a gate (i.e., a displacement in the set {0, , k}). A call from a process in a ring whose number is j may be directed anywhere within the code segment. Ring i Code segment s gates k If a more privileged procedure passes privileged data to a less-privileged procedure, it may only pass by value. For example, if a parameter is in a segment in ring 3, and it is passed to a procedure in ring 4, it may not be passed by reference (because that would allow the procedure in ring 4 to write into ring 3). Advantages of ring structure (compared to supervisor/user mode): The entire OS need not reside Users may Disadvantage of ring structure (compared to capability-based protection): Protection domains are Privilege of ring 0 Privilege of ring 1 Privilege of ring N 1 Lecture 18 Operating System Principles Page 179 Distributed shared memory: In Lecture 2, we saw that multiprocessors come in two varieties: Shared-memory multiprocessors can have more than one processor sharing a region of memory. Processes can therefore communicate via shared memory. Distributed-memory multiprocessors usually have to perform interprocess communication by passing messages from one processor to another. P P P Memory P M P M P M Many techniques are known for programming shared-memory multiprocessors. Communication can be performed by simply writing to memory for the other processors to read. For synchronization, critical sections can be used with semaphores, monitors, etc. providing mutual exclusion. A large body of literature is available on these techniques. For distributed-memory multiprocessors, the situation is quite different. Outline for Lecture 19 I. Distributed shared memory Shared- vs. distributedmemory machines Concept of DSM NUMA multiprocessors Page scanner II. The Mach VM Sharing memory Copy-on-write Address mapping in Mach Communication must use message-passing, making I/O the central abstraction. Message-passing involves many complicating issues, e.g., flow control, lost messages, buffering, blocking. Remote procedure call (RPC; see Lecture 3) can be used in many cases, but it cant be used to pass graphs and other data structures containing pointers; it doesnt work when programs contain global variables, and it makes it expensive to pass large arrays, since they must be passed by value. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 180 Furthermore, shared-memory machines can use messagepassing primitives when appropriate, but the reverse is not true. Thus, shared-memory multiprocessors are much easier to program. Unfortunately, distributed-memory machines are much easier to build. Single-board computer modules, containing processor, memory, and network interface, can be connected together in almost unlimited numbers. Multicomputers with thousands of modules are available from various manufacturers. It is impossible to build a shared-memory multiprocessor with more than two or three dozen processors, because the shared bus becomes a bottleneck. Thus, in the last 10 years, a lot of attention has focused on implementing a shared address space on a distributed-memory machine. Here is a simple description of how distributed shared memory (DSM) works. 1. Each page (or segment) is present on exactly one machine. 2. A processor can refer to local pages using the ordinary hardware, at full memory speed. 3. An attempt to reference a page located on a different module causes a hardware page fault, which is handled by the operating system. The operating system sends a message to the remote machine, which finds the needed page and sends it to the requesting processor. 4. The faulting instruction is restarted, and can now complete. What is the main difference between this and ordinary virtual memory? Lecture 19 Operating System Principles Page 181 What would appear to be the major shortcoming of this approach? Fortunately, it is not necessary to share the entire virtual memory; indeed, it is not even a good idea. Why? Instead, advanced DSM systems share only particular objects within the address space. NUMA multiprocessors: Any process in a DSM system can access memory belonging to any processor, but not equally fast. Access times are non-uniform, and so these machines are known as non-uniform memory access (NUMA) multiprocessors. NUMA multiprocessors vary in how much hardware support is provided for DSM. The fastest NUMA multiprocessors, such as the SGI S2MP architecture and Origin multiprocessor, provide extensive hardware support for very fast reference to remote memory. However, it is also possible to implement DSM systems on networks of workstations (NOWs). Regardless of the implementation, NUMA machines have these properties: It is possible for a processor to access memory anywhere in the system. It takes longer to access remote memory. Thus, it is important to locate pages (or objects) close to the processes that are accessing them. How important depends on how fast remote access is. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 182 When a page fault occurs, the OS has a choice. If the page is read only, it may map the local page to remote memory. This means that all references to the page must replicate the page (make a copy on the local machine). If the page is read-write, it may map the local page to remote memory, or migrate the page to the faulting processor. If the page is migrated to the faulting processor, what else must be done? For either RO pages or RW pages, there is a tradeoff. The OS must guess if the page will be heavily used. Of course, the OS cannot foretell the future. To correct mistakes in page placement, and to adapt to changes in reference patterns, NUMA systems usually have a daemon process, called the page scanner, running in the background. Periodically, the page scanner gathers usage statistics about local and remote references. (These statistics are maintained with help from the hardware, as we will discuss in Lecture 23 and 24.) If usage statistics indicate that a page is in the wrong place, the page scanner unmaps the page so that the next reference causes a page fault. Then a new placement decision can be made. Lecture 19 Operating System Principles Page 183 If a page moves too often within a short interval, the page scanner can mark it as frozen, which causes it to stay put for a certain interval of time or until conditions change. Possible scanner strategies: Invalidate any page for which there have been more remote references than local references. Invalidate a page iff the remote reference count has been > local reference count the last k times the scanner has run. Studies have shown that no single algorithm works well for all programs. The Mach virtual memory: Mach is a distributed version of Unix, developed at Carnegie Mellon University. In Mach, each process (called a task) is assigned a single paged address space. A page in the processs address space is either allocated or unallocated. An unallocated page cannot be addressed by the threads of a task. An allocated page can. An allocated page does not necessarily occupy main memory In Mach, memory is allocated and deallocated in terms of regions. An address is valid only if it falls into an allocated region. Most memory-management hardware today supports an address space of at least 4 GB. Some applications benefit by using a large address space sparsely. E.g. an application that maps several large files into its address space, and then ceases to use them. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 184 Sharing memory: Mach provides several primitives for sharing memory within a task. But sometimes it is necessary to share memory with other tasks. For example, a debugger needs to examine and modify the address space of a task being debugged. Mach provides primitives to perform these operations. In Mach, All threads of a task automatically share all the memory objects that reside in the tasks address space. Different tasks can share a page (or a memory object) by installing it into their virtual address spaces. This requires their page-table entries to cause addresses within the page to be translated to the same page frame. In Mach, all sharing is by inheritance. When a task is created, a new address space is created for it. The address space can either be empty, or be based on an existing address space. When a new address space is based on an existing address space, a page in the new address space is initialized based on the inheritance value of the corresponding page in the existing address space. An inheritance value of none means the child task does not inherit that page, copy means the child task receives a copy of the page, which it can manipulate without affecting the original page, and share means the same copy of the page is shared between child and parent. Lecture 19 Operating System Principles Page 185 Mach uses the principle of lazy evaluation to avoid unneeded work. Virtual-to-physical mapping of a page is postponed till the page is referenced. Page tables are not allocated until they are needed. When a page needs to be copied, it is not copied until it is actually written. This kind of copying is called copy-on-write. How is it implemented? When T1 and T2 want to share a page with copy inheritance, the system gives each of them read-only access to the page. Then when T1 writes into the page, it gets a new copy. Advantages of copy-on-write: The Unix fork operation, for example, is implemented using copy-on-write. Pages that are never modified are thus never copied. Task T1 read write Page p 1 Task T1 read write Page p 1 Virtual copy Task T2 read Task T1 read Page p 1 Task T2 writes page Task T2 read write Page p 2 Address mapping in Mach: Mach distinguishes machineindependent and machine-dependent aspects of address mapping. Page-table organization is machine dependent. The VAX supports separate system and user page tables. The Intel x86 architectures support multilevel page tables. The RS/6000 uses an inverted page table (explained later). Other aspects of address mapping are not machine dependent. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 186 For example, all architectures divide a processs address space into code, data, and stack areas. Also, in Mach, files are mapped into the address space. (This means they are assigned a set of addresses within the address space.) An address map in Mach is a structure that tells what memory objects each part of the address space is associated with. For example, a typical VAX Unix process has four entries in its address map when it is created: Code Stack Initialized data Uninitialized data Address map Head Tail It indicates, for example, that the code occupies virtual addresses 0150K. A typical address map is shown at the right, above. Virtual-memory objects Each entry points to a memory object (the triangles in the diagram). But more than one entry may point to a single object, as when different parts of the same object have different protection or copy attributes. This is illustrated by the diagram of an address-map entry: Prev. entry Start End Inheritance Protection Offset Object Next entry Lecture 19 Operating System Principles Page 187 Address maps allow efficient implementation of the most common operations on the address space of a task, e.g., page-fault lookup, as when a file is accessed for the first time, or copy/protection operations on a memory region. efficiently maintain sparse address spaces. When a new address space is created, and an object has the copy inheritance value, a new memory object is created for it. This new object is called a shadow object. A shadow object need only contain those pages that have been copied. (For other pages, the original memory object is used.) A shadow object may itself be shadowed as the result of a subsequent copy-on-write copy, creating a shadow chain. Addressmap entry Addressmap entry Addressmap entry Copy operation Copy operation Shadow chains This structure cannot, however, maintain shared memory. A change to one copy of a shared object needs to cause all AddressAddressother copies to change. map entry map entry Addressmap entry This requires an extra level of indirection when accessing a shared object. The address-map entry points to a share map, which in turn points to the memory object. task-creation operation inheritance shared Share map 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 188 Memory consistency: In a distributed memory system, references to memory in remote processors do not take place immediately. This raises a potential problem. Suppose that Outline for Lecture 20 I. Memory consistency II. Consistency models not requiring synch. operations Strict consistency Sequential consistency PRAM & processor consist. The value of a particular memory word in processor 2s local memory III. Consistency models is 0. not requiring synch. operations. Then processor 1 writes the value 1 Weak consistency to that word of memory. Note that Release consistency this is a remote write. Processor 2 then reads the word. But, being local, the read occurs quickly, and the value 0 is returned. Whats wrong with this? This situation can be diagrammed like this (the horizontal axis represents time): P1: P2: W (x)1 R (x)0 Depending upon how the program is written, it may or may not be able to tolerate a situation like this. But, in any case, the programmer must understand what can happen when memory is accessed in a DSM system. Consistency models: A consistency model is essentially a contract between the software and the memory. It says that iff they software agrees to obey certain rules, the memory promises to store and retrieve the expected values. Strict consistency: The most obvious consistency model is strict consistency. Strict consistency: Any read to a memory location x returns the value stored by the most recent write operation to x. Lecture 20 Operating System Principles Page 189 This definition implicitly assumes the existence of a global clock, so that the determination of most recent is unambiguous. Strict consistency is supported by uniprocessors, but it is generally impossible to implement in DSM systems. In a DSM system, a read from a nonlocal memory location, or a write to a nonlocal memory location, requires sending a message. This message requires a finite time to travel to and from the node where the memory is located. During that time, some other processor might change the value in the memory location. Regardless of how efficient a message system is, it cannot cause messages to travel faster than the speed of light. If the difference in the two access times is, say, one ns., and the nodes are 3 m. apart, the signal would have to travel at 10 times the speed of light to return the most recent value. Fortunately, strict consistency is rarely necessary. To summarize, with strict consistency, all writes are instantaneously visible to all processes; all subsequent reads see the new value of the write, no matter how soon after the write the read is done, and no matter where the process executing the read is located. Sequential consistency: Strict consistency isnt really necessary to write parallel programs. Earlier in this course, we learned that parallel programs shouldnt make any assumptions about the relative speeds of the processes, or how their actions would interleave in time. Counting on two events within one process to happen so quickly that another process wont have time to do something in between is asking for trouble. Let us weaken the requirements we impose on the memory so that the resulting model is realizable. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 190 Sequential consistency: The result of any execution is the same as if the memory operations of all processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. Whats the difference between strict consistency and this? In sequential consistency, the temporal ordering of events does not matter. All that is required is that the processors see the same ordering of memory operations (regardless of whether this is the order that the operations actually occurred). So, with sequential consistency we dont have to worry about providing up to the nanosecond results to all of the processors. The example below shows the difference between strict and sequential consistency. The two sequences of operations are equally valid. Note that a read from P2 is allowed to return an out-of-date value (because it has not yet seen the previous write). P1: W (x)1 P2: R (x)0 R (x)1 P1: W (x)1 P2: R (x)1 R (x)1 From this we can see that running the same program twice in a row in a system with sequential consistency may not give the same results. While it is possible to implement, sequential consistency has very poor performance. Can you guess why? Lecture 20 Operating System Principles Page 191 Causal consistency: The next step in weakening the consistency constraints is to distinguish between events that are potentially causally connected and those that are not. Two events are causally related if one can influence the other. P1: W (x)1 P2: R (x)1 W (y)2 Here, the write to x could influence the write to y, because On the other hand, without the intervening read, the two writes would not have been causally connected: P1: W (x)1 P2: W (y)2 The following pairs of operations are potentially causally related: A read followed by a later write. A write followed by a later read to the same location. The transitive closure of the above two types of pairs of operations. Operations that are not causally related are said to be concurrent. Causal consistency: Writes that are potentially causally related must be seen in the same order by all processors. Concurrent writes may be seen in a different order by different processors. Here is a sequence of events that is allowed with a causally consistent memory, but disallowed by a sequentially consistent memory: 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 192 P1: W (x)1 W (x)3 P2: R (x)1 W (x)2 P3: R (x)1 R (x)3 R (x)2 P4: R (x)1 R (x)2 R (x)3 Why is this not allowed by sequential consistency? Why is this allowed by causal consistency? What is the violation of causal consistency in the sequence below? P1: W (x)1 P2: R (x)1 W (x)2 P3: R (x)2 R (x)1 P4: R (x)1 R (x)2 Without the R (x)1 by P2, this sequence wouldve been legal. Implementing causal consistency requires the construction of a dependency graph, showing which operations depend on which other operations. PRAM consistency: Causal consistency requires that all processes see causally related writes from all processors in the same order. The next step is to relax this requirement, to require only that writes from the same processor be seen in order. This gives pipelined-RAM (PRAM) consistency. PRAM consistency: Writes performed by a single process are received by all other processors in the order in which they were issued. Lecture 20 Operating System Principles Page 193 Writes from different processors may be seen in a different order by different processors. PRAM consistency is so named because writes can be pipelined; that is, a processor does not have to stall waiting for a write to be completed before starting the next one. PRAM consistency would permit this sequence that we saw violated causal consistency: P1: W (x)1 P2: R (x)1 W (x)2 P3: R (x)2 R (x)1 P4: R (x)1 R (x)2 Another way of looking at this model is that all writes generated by different processors are considered to be concurrent. Sometimes PRAM consistency can lead to counterintuitive results. P1: a := 0; : a := 1; if b = 0 then kill(p2); P2: b := 0; : b := 1; if a = 0 then kill(p1); At first glance, it seems that no more than one process should be killed. With PRAM consistency, however, it is possible for both to be killed. Processor consistency: Processor consistency is very similar to PRAM consistency, but it has one additional condition: memory coherence. Memory coherence requires that writes to the same location be viewed in the same order by all the processors. Writes to different locations need not be seen in the same order by different processors. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 194 Processor consistency = PRAM consistency + memory coherence. Weak consistency: PRAM and processor consistency are still stronger than necessary for many programs, because they require that writes originating in a single processor be seen in order everywhere. But it is not always necessary for other processors to see writes in orderor even to see all writes, for that matter. Suppose a processor is in a tight loop in a critical section, reading and writing variables. Other processes arent supposed to touch these variables until the process exits its critical section. Under PRAM consistency, the memory has no way of knowing that other processes dont care about these writes, so it has to propagate all writes to all other processors in the normal way. To relax our consistency model further, we have to divide memory operations into two classes and treat them differently. Accesses to synchronization variables are sequentially consistent. Accesses to other memory locations can be treated as concurrent. This strategy is known as weak consistency. With weak consistency, we dont need to propagate accesses that occur during a critical section. We can just wait until the process exits its critical section, and then make sure that the results are propagated throughout the system, and stop other actions from taking place until this has happened. Similarly, when we want to enter a critical section, we need to make sure that all previous writes have finished. Lecture 20 Operating System Principles Page 195 These constraints yield the following definition: Weak consistency: A memory system exhibits weak consistency iff 1. Accesses to synchronization variables are sequentially consistent. 2. No access to a synchronization variable can be performed until all previous writes have completed everywhere. 3. No data access (read or write) can be performed until all previous accesses to synchronization variables have been performed. Thus, by doing a synchronization before reading shared data, a process can be assured of getting the most recent values. Note that this model does not allow more than one critical section to execute at a time, even if the critical sections involve disjoint sets of variables. This model puts a greater burden on the programmer, who must decide which variables are synchronization variables. Weak consistency says that memory does not have to be kept up to date between synchronization operations. This is similar to how a compiler can put variables in registers for efficiencys sake. Memory is only up to date when these variables are written back. If there were any possibility that another process would want to read these variables, they couldnt be kept in registers. This shows that processes can live with out-of-date values, provided that they know when to access them and when not to. The following is a legal sequence under weak consistency. Can you explain why? P1: W (x)1 W (x)2 S P2: R (x)2 P3: R (x)1 1998 Edward F. Gehringer R (x)1 R (x)2 S S Page 196 CSC/ECE 501 Lecture Notes Heres a sequence thats illegal under weak consistency. Why? P1: W (x)1 W (x)2 S P2: S R (x)1 Release consistency: Weak consistency does not distinguish between entry to critical section and exit from it. Thus, on both occasions, it has to take the actions appropriate to both: making sure that all locally initiated writes have been propagated to all other memories, and making sure that the local processor has seen all previous writes anywhere in the system. If the memory could tell the difference between entry and exit of a critical section, it would only need to satisfy one of these conditions. Release consistency provides two operations: acquire operations tell the memory system that a critical section is about to be entered. release operations say a c. s. has just been exited. It is possible to acquire or release a single synchronization variable, so more than one c.s. can be in progress at a time. When an acquire occurs, the memory will make sure that all the local copies of shared variables are brought up to date. When a release is done, the shared variables that have been changed are propagated out to the other processors. Lecture 20 Operating System Principles Page 197 But doing an acquire does not guarantee that locally made changes will be propagated out immediately. doing a release does not necessarily import changes from other processors. Here is an example of a valid event sequence for release consistency (A stands for acquire, and Q for release or quit): P1: A (L) W (x)1 W (x)2 Q (L ) P2: A (L)R (x)2 Q (L ) P3: R (x)1 Note that since P3 has not done a synchronize, it does not necessarily get the new value of x. Release consistency: A system is release consistent if it obeys these rules: 1. Before an ordinary access to a shared variable is performed, all previous acquires done by the process must have completed. 2. Before a release is allowed to be performed, all previous reads and writes done by the process must have completed. 3. The acquire and release accesses must be processor consistent. If these conditions are met, and processes use acquire and release properly, the results of an execution will be the same as on a sequentially consistent memory. Summary: Strict consistency is impossible. Sequential consistency is possible, but costly. The model can be relaxed in various ways. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 198 Consistency models not using synchronization operations: Type of consistency Strict Description All processes see absolute time ordering of all shared accesses. Sequential All processes see all shared accesses in same order. All processes see all causally related shared accesses in Causal the same order. Processor PRAM PRAM consistency + memory coherence All processes see writes from each processor in the order they were initiated. Writes from different processors may not be seen in the same order. Consistency models using synchronization operations: Type of consistency Weak Release Description Shared data can only be counted on to be consistent after a synchronization is done. Shared data are made consistent when a critical region is exited. The following diagram contrasts various forms of consistency. Sequential consistency Processor consistency Weak consistency Release consistency {M, M} ACQUIRE {M, M} {M, M} RELEASE RELEASE RELEASE R R {M, M} SYNCH {M, M} SYNCH W R R W R W : : Lecture 20 {W, R} : : : : : Operating System Principles Page 199 In summary, In sequential consistency, R and W follow program order. In PRAM OR processor consistency, R may precede buffered W; other processors may see different order of {W, R} access. In weak consistency, SYNCH operations are sequentially consistent; other memory operations can occur in any order. In release consistency, SYNCH operations are split into ACQUIRE (lock) and RELEASE (unlock), and these operations are processor consistent. Page-based distributed shared memory: So far, we have not differentiated between DSM systems built on multiprocessors and on networks. Outline for Lecture 21 I. NUMA vs. NORMA II. Replication of memory In both types of systems, memory A. Granularity of chunks consistency is needed, and we can use B. False sharing inheritance to decide whether to share or III. Achieving sequencopy pages when a new process is created. tial consistency However, there are differences, due to latency of interprocessor communication, and whether address-mapping hardware is available to translate remote references. Thus, we will differentiate between NUMA systems, which have hardware for accessing remote memory, and NORMA (no remote memory access) systems, where remote memory accesses must be performed totally in software. When a NUMA system references a remote page, it may fetch the page, or use it remotely, with the MMU mapping each access. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 200 A. B. C. D. Update vs. invalidate A 12-state protocol Finding the owner Locating copies of pages IV. Synchronization in DSM systems On a NORMA system, the MMU cant address remote memory, so the page must be fetched. Small (bus-based) multiprocessors provide sequential consistency, and early NORMA DSM systems tried to replicate this, so that programs would not have to be rewritten. But it was soon realized that major performance gains could be achieved by relaxing the consistency model. Thus, programs had to be rewritten. Replication of memory: In a DSM system, when a processor references an address that is not local, a trap occurs, and the DSM software fetches the chunk of memory containing the address, and restarts the faulting instruction, which can now complete successfully. For example, in the diagram below, if processor 0 references instructions or data in chunks 0 or 2, the references are done locally (chunk may or may not = page). References to other chunks cause 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 2 1 8 3 11 4 5 12 7 14 9 6 15 10 13 CPU 0 CPU 1 CPU 2 CPU 3 For example, a reference to chunk 10 will cause a trap to the DSM software, which then moves chunk from machine to machine . Lecture 21 Operating System Principles Page 201 0 2 10 1 8 3 11 4 5 12 7 14 9 6 15 13 CPU 0 CPU 1 CPU 2 CPU 3 However, suppose that page 10 contains code. After the move, if processor 3 is running this program, the first reference to page 10 will cause a fault. Page 10 will ping-pong back and forth. Hence, it is a good idea to allow copies of pages to be made, as in Mach: 0 2 10 1 8 3 11 4 5 12 7 14 9 6 15 10 13 CPU 0 CPU 1 CPU 2 CPU 3 Another possibility is to replicate all chunks. As long as a chunk is being shared, multiple copies can exist. When a shared chunk is first written, special action must be taken to prevent inconsistency. Granularity of chunks: So far, we have not specified how large a chunk is. On a multiprocessor, it may be quite small. The MMU of the requesting processor knows what word is being fetched, and can immediately request it over the interconnection network within microseconds or nanoseconds. So usually the unit of transfer is a cache line ( 32128 bytes). But in NORMA systems, the overhead of transfer is much greater. This argues for a to be made so 1998 Edward F. Gehringer chunk size, so that transfers dont need . CSC/ECE 501 Lecture Notes Page 202 One possibility is to transfer a page at a time. This makes it simpler to integrate DSM with virtual memory, since the unit of transfer (page) is the same for both. But it may be even better to transfer a larger chunk, say 2, 4, or 8 pages. Advantage: Disadvantages: A large chunk size exacerbates the problem of false sharing. Suppose that processor 1 is making heavy use of a variable a. processor 2 is making heavy use of a variable b. Variables a and b are not on the same page, but they are in the same 2-page chunk. a b a b CPU 1 CPU 2 What happens? If 1 chunk = 1 page, If 1 chunk > 1 page, Clever compilers that understand which variables are shared can place variables in different chunks and help avoid false sharing. Lecture 21 Operating System Principles Page 203 However, this is not always possible, especially if a and b are array elements in close proximity. Achieving sequential consistency: Whenever pages are copied, e.g., for performance reasons, the copies must be kept consistent. If the pages are read-only, no special action need be taken. An ordinary entry is made in the page table of each process using the page, and the MMU is used as normal to perform address translation. If the copies are writable, however, the first write to the page must be trapped (treated like a page fault). The operating system then gains control. It has two choices. Update the other copies of the page. This involves sending the address of the modified word and its new value to all other processors that might have copies. The other processors then update their local copy of the page. Invalidate the other copies. This involves sending the address (only) to all of the other processors. The other processors mark the page as not present, and on their next reference to that page, a page fault occurs. Hence, only one copy of the page remains. NUMA multiprocessors can use either of these two strategies (usually on cache lines, rather than whole pages; this is covered in CSC/ECE 506). However, in a NORMA DSM system, one of these strategies is not feasible. Can you guess which? The problem is that in a NORMA system, without special hardware, the only way to mark a reference for special handling (e.g., invalidation or update) is to mark the page as read-only or not present Then, when the page is written, a trap will invoke the operating system, which can take care of notifying the other processors. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 204 With a protocol based on invalidation, the OS needs to be invoked only the first time the page is written. There is some overhead to this, but in general, it is feasible. With a protocol based on , the OS needs to be invoked each time the page is written. This causes a context swap at every write reference, and is obviously prohibitive. Besides, there is a consistency problem. Do you see what it is? An invalidation protocol for page-based DSM: At any time, each page is either in state R (readable), or W (readable and writable). The state may change as execution progresses. Each page has an owner, which is the process that most recently wrote the page. If a page is in state R, the owner has a copy If a page is in state W, only one copy exists (the owners). There are twelve possible transitions, depending on whether A process, say p0, reads or writes. p0s processor is the owner of the page. The state of the page is R, with a single outstanding copy, W, with a single outstanding copy, or R, with multiple copies. Lecture 21 Operating System Principles Page 205 Action when process p0 reads Owner Action when process p0 writes 1. Do read p0 W Processor 1 1. Do write Processor 0 The simplest case is when p0 is running on the owning processor, and the page is held with write access. In this case, either a read or a write can proceed immediately. Owner 1. Do read p0 R Processor 1 1. Mark page as 2. Do write Processor 0 If p0 is running on the owning processor and the page is held with read access, a read proceeds immediately, but a write requires a change of state. Owner 1. Do read p0 R R Processor 1 1. 2. Mark page as W 3. Do write Processor 0 If p0 is running on the owning processor and there are multiple read copies, a read proceeds immediately, but a write requires that the other copies Owner 1. Do read p0 R R Processor 1 Processor 0 1. Ask for 2. Ask for ownership 3. Mark page as W 4. Do write If p0 is running on a non-owning processor, a read still proceeds immediately, but a write requires that the requesting processor gain ownership. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 206 Action when process p0 reads 1. Ask for 2. Mark page as R 3. Do read Owner Action when process p0 writes 1. Ask for invalidation 2. Ask for ownership 3. Ask for page 4. Mark page as W 5. Do write p0 Processor 0 R Processor 1 With the situation the same as above, but no local copy, a write proceeds as before, except that the page needs to be . a read requires the page to be first. 1. Ask for degradation 2. Ask for copy 3. Mark page as R 4. Do read Owner p0 Processor 0 W Processor 1 1. Ask for invalidation 2. Ask for ownership 3. Ask for page 4. Mark page as W 5. Do write With the situation the same as before, but a non-local page in state W, a write proceeds as in the previous case. a read requires the access mode to be degraded to R. Note that the other processor can still keep a copy of the page. Which of these cases require traps to the operating system? Note that in all cases, before a write is performed, the protocol guarantees that only a single copy of the page exists. Thus, is maintained. Lecture 21 Operating System Principles Page 207 Finding the owner: The replication protocol requires that the owner be found whenever It is not practical to broadcast these requests to every processor. Why? or For this reason, one process may be designated as the page manager. The page manager keeps track of which processor owns each page. When a process p0 wants to contact the owner of a page, it sends a message to the page manager asking . The page manager replies. p0 sends a message to the owner, which performs the action and confirms. Four messages are required, as shown below. Page manager es t Re qu 1. p0 3. Request Owner 4. Reply An obvious optimization is to have the page manager forward the request to the owning processor. This reduces the number of messages to three. 1998 Edward F. Gehringer 2. CSC/ECE 501 Lecture Notes Re pl y Page 208 Page manager F 2. qu es t rd wa or 1. p0 Re 3. Reply Owner Still, the performance of this scheme could be bad. Why? The solution is to provide This introduces a problemfinding the right page manager. Suppose we use a few bits of the page number to identify the page managerthe three least-significant bits, for example. Then page manager 0 would handle all pages whose number ends with 000, page manager 1 would handle all pages whose number ends with 001, etc. Why use the least-significant bits, rather than, e.g., the mostsignificant bits? Finding copies of pages: How can copies be found when they need to be invalidated? The first way is to broadcast a message and rely upon all of the processors to look at it and invalidate the page if they hold a copy. However, this is not only slow, but unreliable if a message can be lost. Lecture 21 Operating System Principles Page 209 A better way is to have each page manager keep a list of which processors have copies of each page it manages. When a page must be invalidated, the old owner, new owner, or page manager sends a message to each processor holding the page, and waits for an ACK. 2 2 3 13 10 18 7 15 CPU 2 When the message has been acknowledged, the invalidation is complete. Synchronization: Recall from Lecture 5 that a test_and_set instruction. It returns the value of its argument, and sets the variable specified by the argument to true, in a single atomic operation. In a DSM system, the code works fine as long as only one process at a time is competing for entry to the critical section. If p0 is inside the critical section, and p1 (on a different machine) wants to enter, p1 The page containing the variable remains on p1s machine. When p0 exits the critical section, it pulls over this page and sets the variable to false. p1 immediately encounters a page fault and pulls the page back. So, the exit and entry have been performed with just two page faults. However, the performance can be much worse than this. How? 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 210 Not only is this slow for the processors, but it creates enormous network traffic. Therefore, a DSM system often provides an additional mechanism for synchronization. A synchronization manager(s) can accept messages asking to enter and leave critical sections (or set and reset locks), and send a reply back when the requested action has been accomplished. If the critical section cant be entered immediately, a reply is deferred, causing the requester to block. When the critical section becomes available, a message is sent back. This accomplishes synchronization with a minimum of network traffic. We will take this up in more detail when we consider distributed process management in Lecture 25. Managing the physical address space: So far, weve been studying the virtual address spacewhats in it, how its addressed, when its moved. Now we want to concentrate more on physical memory. Techniques depend upon the architecture. Paged virtual memorymain concern is when to bring a page of data into main memory, and which page to remove. Segmented virtual memory, or Multiprogrammed non-VM management of free memory is also a major concern. Outline for Lecture 22 I. Management of available storage A. Chained pool B. Buddy system C. The 50% rule II. Swapping & paging A. B. C. D. E. Handling a page fault The page table The frame table The TLB Locked-in pages III. Inverted page tables Management of available physical memory: We will study two methods: a chained pool of free storage, and the buddy system. With either method, managing free storage consists of three tasks:. 1. Finding a hole large enough for a segment of a given size. 2. Updating the list of holes when a segment is allocated. 3. Updating the list of holes when a segment is deallocated. Lecture 22 Operating System Principles Page 211 Chained Pool: An available-space list (list of holes) is used. Has these characteristics: Ordered by address. Aids in coalescing holes when a segment is deallocated. Consists of nodes, each of which begins with a header composed of a size field and a pointer to the next node on the list. 0 14 0 14 32 12 32 60 24 60 0 18 A dummy node of length 0 resides permanently at the beginning of the list. For allocated segments, we will assume that the size of the segment is obtained from the segment table, not from a field within the segment. (Bad for robustness.) When memory is to be allocated, the list may be searched by one of several methods: 1. First fit. Allocate the first hole on the list that is big enough. 2. Best fit. Allocate the hole which most nearly matches the requested size. 3. Modified best fit. Best fit, except that either small leftover holes of size e are included in the allocated block, or all blocks allocated are required to be multiples of a certain size e. 4. Next fitfirst fit that remembers where it left off. The search algorithm for next fit Needs to delete a hole if its size is exactly the size of the allocated segment. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 212 Exact-size hole: Before After (unallocated) Non-exact size hole: Before After Keeps a variable trail which points to where the search left off last time. The space-allocation algorithm If the hole is the same size as the segment, the hole is deleted. Otherwise, space is taken from the end of the hole so that links dont have to be changed. The space-deallocation algorithm- Note that freed space might have to be merged with the holes preceding and/or following it. Hole on neither Hole below Hole above Hole both sides (block to be freed) (unallocated) (block to be freed) (block to be freed) (block to be freed) Note that this method requires searching the free list on both allocation and deallocation. Another method, using boundary tags, avoids the need to search at deallocation by using a doubly linked but unordered available-space list, and Lecture 22 Operating System Principles Page 213 in both the first and last word of each block, size fields and boolean boundary tags, which tell whether the block is allocated or not. Allocated + Size Unallocated Size Link Link Size + Compaction (S&G, 8.4.3): After the system executes for awhile using first-fit, next-fit, or a related algorithm, memory is likely to be fragmented, as shown on the right (S&G, Fig. 8.11). Now if a request for 500K is received, it cannot be honored, despite the fact that a total of 900K is free. One solution is to move all of the allocated blocks to one end of memory, as shown below, left. 0 300K 500K 600K 800K 1200K OS 0 300K 500K 600K 1000K 1200K OS 0 300K 500K 600K OS 0 300K 500K 600K 1000K 1200K 1500K OS P1 P2 P3 P4 1900K 2100K Original allocation P1 P2 P3 P4 P1 P2 P4 P3 P1 P2 1500K P4 1900K 2100K 2100K 2100K P3 How many words need to be moved in this case? A more intelligent approach would move fewer words to achieve the same result, as shown above, middle. How many words are moved here? 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 214 If we are willing to allow the hole to be somewhere other than one end of memory, we may reduce the copying requirement still further. How many words need to be moved in the diagram on the right? Storage compaction is expensive. It should be used only when no hole is large enough to satisfy a request. It is usually cheaper to swap segments. Storage compaction is sometimes impossible. Why? Hint: Consider address relocation. The buddy system: An alternative to the chained-pool methods. Block sizes are restricted to powers of two. A separate free list is used for each size of block 2 12 Each address that ends in k zeros is considered to be the the starting address of a block of length 2k. 28 28 28 28 2 10 2 11 If a block of size 2k is requested and none is available, list k +1 is checked for a block of size 2k+1. If one is found, it is split into two buddies, each of size 2k, and one of the buddies is allocated, and the other is placed on the size-2k free list. If none is found, the next larger size block (size 2 k+2) is sought, and so on. When a block of size 2k is deallocated, a test is made to see whether its buddy is free. If so, the two blocks are recombined into one of the next larger size (2k+1), which in turn is recombined with its buddy, if available, and so forth. Lecture 22 Operating System Principles Page 215 The splitting method guarantees that the starting addresses of two buddy blocks of size 2k differ only in the (k +1)st digit from the right. k bits of zeros Addr(B) 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 Addr(Buddy(B)) 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 Let Addr(B ) represent the address of a block of size 2k. Then what is the address of B s buddy? The fifty-percent rule: Regardless of what storage-allocation algorithm is being used, at equilibrium, there will be half as many holes as allocated blocks. Consider an allocated block in memory. Half of the operations on the block above it will be allocations, and half will be deallocations. Thus, half the time it has another allocated block as an upper neighbor, and half the time it has a hole for an upper neighbor. Therefore, if the mean number of allocated blocks is n, the mean number of holes is n/2. Swapping and paging: Information may be transferred into main memory according to one of two policies. Demand paging (or swapping). A page is not swapped in until it is referenced by a process. A page fault then occurs, and the process must wait. Prepaging (pre-swapping). The program requests transfers of pages some time before it needs to use them. Useful in closed (e.g. some real-time) systems. A page fault may occur while fetching an instruction, or fetching or storing data (an operand of an instruction). 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 216 Steps in handling a page fault: 1. On a TLB miss, check the page-table entry for the referenced page. If this is a page fault, the presence bit will be off. 2. Trap to the operating system. 3. Select a free page frame (or a victim, if all frames are occupied). Write out the victim if it has changed. Determine the location of the desired page on secondary storage. Block the faulting process. 4. Issue a read request from secondary storage to the free frame. 5. When the read operation is complete, modify the page table and frame table (described below). 6. Restart the instruction that caused the page fault. (The instruction must be re-fetched, then executed.) The page table contains this information in each entry: Page-frame number or secondary-storage address Presence bit Valid bit Access rights The presence bit tells whether the page is in main memory. The valid bit tells whether this page exists. If the referenced address is too high, an exception will be generated. The access rights tell what kinds of access the process has to this page. The page-frame number and secondary-storage address can be stored in the same field, as explained below. Lecture 22 Operating System Principles Page 217 The frame table is a companion to the page tables. Page-descriptor address Mode Secondary-storage address It tells how to find the page descriptor when its time to swap out a page. The secondary-storage field allows page tables to be made more compact. How? The mode field has four values. free in transition in use locked in Given a physical frame number, it is easy to derive the location of the page descriptor. How? Each entry in the TLB contains six fields: Key PID Page # Use Dirty Access rts. Page-frame # Process ID. Identifies the process. Key. The page number. Use bit. Records whether page has been used recently. Dirty bit. Records whether page has been written into recently. Access rights. Page frame. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 218 Locked-in pages (S&G, 9.8.5): Why is it necessary to lock some pages in memory? For one thing, e.g., Consider I/O. I/O devices usually write to and read from physical addresses, not virtual addresses. What problem does this pose? There are two common solutions to this problem. Always allocate I/O buffers in the operating system, not in user space. Disadvantage? Lock in pages that have I/O requests in progress. Another use of locked-in pages is to prevent a low-priority processs pages from being reclaimed before they are ever used. Inverted page tables (S&G 9.8.3): We have assumed that a system maintains both a page table and a frame table. The frame table holds descriptor addresses. Then from each frame-table entry, the corresponding page-table entry can be found. Suppose instead of having a frame-table entry point at a pagetable entry, the frame-table entry included all the information in a page-table entry. Then we wouldnt have to keep the page table in main memory at all! Lecture 22 Operating System Principles Page 219 What would be the advantage of this? As processes address space grows large, it also tends to be sparsely populated. When the frame table is used as a page table, it is called an inverted page table. Systems using an inverted page table include IBM System/38 and AS 400 pid p IBM RT IBM RS/6000 HP Spectrum workstations. Here is a simplified description of how the inverted page table is used in the RT. A virtual address always has its process number prepended: process-id, page #, displacement If there is a TLB hit, translation proceeds just as in a system with an ordinary page table. If there is a TLB miss, the process-id, page # pair is looked up in the inverted page table. Inverted page table miss Hash table Bring in (regular) page table phys. addr. not found d TLB phys. addr. But the inverted page table is indexed by physical address, not virtual address. So in the worst case, it might be necessary to search all the way through the table. Instead, a hash table is used to map each process-id, page # pair to an inverted-page-table entry. But whenever there is a page fault, the process-id, page # pair will not be in the inverted page table. Then (and only then) the (regular) page table must be brought into main memory. Since the regular page table is paged, only a single page needs to be brought in. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 220 Downside: A single page fault may now cause another page fault. Upside: It is not necessary to keep pages of the page table in memory just in case of a TLB miss. They are now needed only when a page fault occurs. Single-virtual-address-space systems: Inverted page tables are especially useful on systems where all processes share a single large address space. In such systems, it is much easier to share pages and other objects. But the address space will be used very sparsely by a single process. Hence, keeping regular page tables in main memory would be a waste. Performance of demand paging: Let a memory access time (usually 50500 ns.), p probability that a memory reference causes a page fault, and s time to service a page fault. Then the effective access time Outline for Lecture 23 I. Performance of demand paging II. Replacement strategies A. FIFO B. OPT C. LRU III. Stack algorithms & priority lists IV. Approximations to LRU replacement A. Clock B. Second chance e = (1p) a + p s Major components of a page-fault service time are Time to service the page-fault interrupt. Time to swap in the desired page. Time to restart the faulting process Assume a = 100 ns., s = 1 ms. V. Page vs. segment replacement VI. Page replacement in DSM systems p = 103 e = To achieve less than 10% degradation, we need p < Lecture 23 Operating System Principles Page 221 Replacement strategies: A replacement strategy answers the question, When a page (or segment) must be brought in, and there is not enough free memory to do so, which page should be removed? The goal of a replacement strategy is to minimize the fault rate. To evaluate a replacement algorithm, we need to specify the size of a page, a set of reference strings (or, page traces), and the number of page frames. Given these inputs, we can determine the number of page faults, and hence the page-fault rate. A page trace (or reference string) is a list of pages in order of their reference by the processor. For example, the following is a page trace: abcdabeabcde Given a particular size memory, we can simulate which pages would be in memory for any replacement algorithm. FIFORemove the page that has been in memory the longest. This requires implementation of a FIFO queue in which pages are ordered by arrival time. This is not always a good algorithm, because a page that is only used once remains in memory as long as a page that is used heavily. It also suffers from an anomaly (Beladys anomaly): sometimes increasing the size of memory increases the fault rate . Example: Suppose the algorithm is FIFO and there is only one page frame. The letters above the line give the page trace, and the letters below the line indicate which pages are in memory. An asterisk indicates a page fault. In a one-page memory: 12 faults. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 222 a* b * c * d * a * b * e * a* b * c * d * e * abcdabeabcde Indeed, since there is only one page frame, and the same page is never referenced twice in a row, each reference produces a fault. In a two-page memory: 12 faults. a * b* c * d* a* b* e * a * b * c * d * e * abcdabeabcde abcdabeabcd In a three-page memory: faults. a* b* c * abc ab a d* d c b a* a d c b* e a b a d bc de In a four-page memory: 10 faults. a* b* c * abc ab a d* d c b a a d c b a b d c b a e* e d c b a* a e d c b* b a e d c* c b a e d* d c b a e* e d c b In a five-page memory: 5 faults. a* b* c * abc ab a d* d c b a a d c b a b d c b a e* e d c b a a e d c b a b e d c b a c e d c b a d e d c b a e e d c b a Lecture 23 Operating System Principles Page 223 OPT replacement. Replace the page that will never be needed again, or at least will be needed furthest in the future. Priority lists are used throughout. Notice that pages in the lower right have the same priority. In a one-page memory: 12 faults. a* b * c * d* a * b * e * a * b * c * d * e * abcdabeabcde In a two-page memory: 9 faults. a * b * c * d * a b * e * a b * c * d* e abcdabeabcde aaadaaeeeed In a three-page memory: 7 faults. a * b* c * abc aa b d* d a b a a b d b b a d e* e a b a a b e b b e a c* c e b d* d e c e e d c In a four-page memory: 6 faults. a* b* c * abc aa b d* d a b c a a b c d b b a c d e* e a b c a a b c e b b c e a c c e b a d* d e c b e e d c b In a five-page memory: 5 faults. a * b* c * abc aa b d* d a b c a a b c d b b a c d e* e a b c d a a b c d e b b c d e a c c d e b a d d e c b a e e d c b a Though unrealizable, the OPT algorithm provides a standard to measure other replacement algorithms against. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 224 A replacement algorithm satisfies the inclusion property if the set of pages in a k-page memory is always a subset of the pages in a (k +1)-page memory. The OPT algorithm satisfies the inclusion property. The LRU algorithm. Replace the page that was least recently used. Based on the assumption that the recent past can be used to predict the near future. The LRU algorithm satisfies the inclusion property. Here is an example of its performance on our page trace: In a one-page memory: 12 faults. a * b * c * d * a * b * e * a* b* c * d* e* abcdabeabcde In a two-page memory 12 faults. a * b* c * d* a* b * e * a * b * c * d * e * abcdabeabcde abcdabeabcd In a three-page memory: 10 faults. a* b * c * abc ab a d* d c b a* a d c b* b a d e* e b a a a e b b b a e c* c b a d* d c b e* e d c In a four-page memory: 8 faults. a* b* c * abc ab a d* d c b a a a d c b b b a d c e* e b a d a a e b d b b a e d c* c b a e d* d c b a e* e d c b Lecture 23 Operating System Principles Page 225 In a five-page memory: 5 faults. a* b* c * abc ab a d* d c b a a a d c b b b a d c e* e b a d c a a e b d c b b a e d c c c b a e d d d c b a e e e d c b a Stack algorithms and priority lists: An algorithm that satisfies the inclusion property is known as a stack algorithm. The name comes from the way that the list of pages in memory can be updated after each memory reference. At each memory reference, the referenced page is either already in memory or it isnt. If it is in memory at distance i from the top of the stack, the list of referenced pages changes as shown on the left, below. If it isnt in memory, the changes are shown at the right. Referenced page already in memory p p p 1 2 Referenced page not in memory referenced page 1 2 p p 1 2 p 1 2 p p p i 1 i p p i +1 p 1998 Edward F. Gehringer n p i i 1 p p i +1 p n p i 1 i p p i +1 p CSC/ECE 501 Lecture Notes n p i i 1 p p i +1 p n victim Page 226 When a new page needs to be brought in, the page that gets replaced is defined to have the lowest priority of all pages in memory. In the LRU algorithm, the page at depth n (where n = # of page frames) is always the one with the lowest priority. For other stack algorithms, this is not always the case. In OPT, for example, a pages position on the stack is related to how long in the past it was last referenced, but its priority depends on how soon in the future it will be referenced. This diagram shows the priority lists and stacks for a sample reference string. a* b* c * a a a b a b c c a b b a c a c b d* b b a d c d a b c a d c b b a d c a d c a b a b d c d c d a b d a b c c d a b c c d a b d a b c d d c a b Priority Lists a b a Stacks In any priority list, underlined pages have the same priority. To update the stack, we must compare the priorities of pages to determine whether they should move downward on the stack. At each comparison, the lower-priority page moves downward, but the higher-priority page does not. Lecture 23 Operating System Principles Page 227 Referenced page in memory p p p p 1 2 3 4 Refd. page not in memory referenced page p 1 2 3 4 p p p p p p p p 1 2 3 4 5 p 1 2 3 4 5 p p p p p i 1 i p p i +1 p Lets consider an example of stack updating from the OPT example above. Approximations to LRU page-replacement algorithm: Worst problem with LRU is that it requires a lot of hardware assistance. Since direct support for LRU replacement is expensive, approximations are usually used. Most of these methods employ a used bit for each page. When the page is referenced (read or written), the used bit is set to 1 (by hardware). The used bits are reset to 0 periodically (e.g. when a page is swapped out). The clock algorithm is an important LRU approximation. It also uses a dirty bit, which tells whether a page has been written recently. It cycles through the frame table, changing the settings of the used and dirty bits. When the algorithm reaches the end of the frame table, it goes back to the beginning. Thus, the table can be viewed as circular, like a clock: 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 228 n p i i 1 p p i +1 p n p n p n victim (used bit, dirty bit) (1, 1) (1, 0) (1, 1) (0, 1) (1, 0) (1, 1) (1, 1) (1, 1) (1, 1) (1, 0) (1, 1) (0, 1) (1, 0) (1, 1) hand (0, 1) (0, 1) The algorithm proceeds as follows. Hand is a variable whose value is saved from one call to the next. while not done do begin if used = 1 then used := 0 {in the entry pointed to by hand} else if dirty = 1 then begin save := dirty ; dirty := 0 end else begin victim := hand ; {Mark this page for swap out} done := true ; end; hand := hand + 1 mod table_length ; end; Notice that if the algorithm encounters an in-use page frame with (used bit, dirty bit) = (1,1) or (1,0) it turns off the used bit and proceeds to the next in-use frame. (First chance ) if the algorithm finds (used bit, dirty bit) = (0,1) it saves the value of the dirty bit and changes the state to (0,0). (Second chance ) it gives the dirty page a third chance. Lecture 23 Operating System Principles Page 229 Why does the algorithm favor leaving dirty pages in memory? The second-chance algorithm given in Silberschatz & Galvin is a simpler version of the clock algorithm that doesnt use the dirty bit. Page replacement vs. segment replacement: In segment replacement To bring one segment in, more than one might need to be kicked out. The segments to be replaced must be Segments are accumulated (via LRU approximation, etc.) until total (contiguous) memory is large enough to satisfy request. Then swap out or release segments in order of descending size. (Not all may need to be released.) Page replacement in DSM systems: As in any other system, when a DSM system needs to bring in a page, there may not be any free page frames. Thus, a page needs to be evicted from memory: Which page? Where should it be put? Which page? A DSM system is different from other VM systems, in that Pages may be invalidated spontaneously due to actions of other processes. A clean page in the local memory may also be present in other memories. When a page is evicted, the owner or page manager must be informed of this decision. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 230 If no replicated page is a suitable victim, then a non-replicated page must be chosen, often by some LRU approximation. Where to put it? There are two choices. Back on disk. It might be sent back to its home processor. But this might require the home processor to reserve a lot of memory for returned pages. Another strategy would be to piggyback the number of free page frames on each message sent. Then a faulting processor could hand off a page to a processor with plenty of free memory. Locality of reference: Temporal locality. Once a location is referenced, it is often referenced again very soon. Spatial locality. Once a location is referenced, a nearby location is often referenced soon after. Good, modular programs have high locality. See the graph of program locality in 9.7.1 (p. 319) of Silberschatz & Galvin. Outline for Lecture 24 I. Locality of reference A. And program structure II. User-level memory managers. III. The working-set model A. Preventing thrashing B. Working sets C. Properties of working sets D. Tracking working-set size: Maniac E. The WSClock algorithm Importance of program structure (S&G 9.8.4): Sometimes the order in which data items are referenced has a great effect on locality. IV. Load control in Unix V. Perf. of paging algs. Consider this program fragment, and assume that pages are 128 words long. var A: array [1 . . 128] of array [1 . . 128] of integer; for j := 1 to 128 do for i := 1 to 128 do A[i, j] := 0; Notice that the array is stored in row-major order: Lecture 24 Operating System Principles Page 231 A[1][1] A[1][2] A[1][128] A[2][1] A[2][128] A[128][128] Thus, each row takes one page. What happens if the system has at least 128 page frames to hold the array? What happens if the system has less than 128 pages for the array? Changing the code can allow the program to execute efficiently with only a single page frame for the array: var A: array [1 . . 128] of array [1 . . 128] of integer; A[i, j] := 0; For efficient access, the compiler can try to not to split data structures across page boundaries, and to keep routines that call each other on the same page. User-level memory managers: In the Mach operating system, most secondary-storage objects are mapped into the address space. This means that files, databases, etc. are assigned virtual addresses. To reference them, the program simply generates their address; no special I/O calls need to be used. The programmer may know something about the way that a particular object will be referenced. (S)He can then choose a page-replacement algorithm that works well with that structure. For example, what would be a good page-replacement algorithm for a sequential file? In Mach, the kernel doesnt decide what page-replacement algorithm is used. Rather, a separate page-replacement algorithm could be used for each memory object. Mach provides a default memory manager, but the programmer does not have to use it for each memory object. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 232 Thrashing and the working-set model: Most programs have a threshold beyond which incremental increases in memory relatively little advantage, but a small decrease in memory drastic increase in pagefault rate. A plot of the number of faults as a function of the number of pages is known as a Parachor curve. Memory size Informally, a processs working set is the set of pages that must be in memory for the process to execute efficiently. If a process does not have enough pages, the page-fault rate is very high low CPU utilization. operating system thinks it needs to increase the degree of multiprogramming another process added to the system. Thrashing means that the operating system is busy swapping pages in and out. C P U U t i l. Degree of multiprogramming Definition: The working set of a process W(t, ) at time t is the set of pages referenced by the process during the process-time interval (t , t). is called the window size . (Note that the interval is measured in terms of process time, not real time.) Lecture 24 Operating System Principles Page 233 # page faults For example, if a process executes for time units and uses only a single page, then | W(t, ) | = . Note that the working-set size is a monotonic function of . A working set is a reliable estimate of a processs needs if it is large enough to contain pages being frequently accessed, and no others. too small will not encompass working set. too large will encompass several localities. The working-set principle says that a process should be eligible for execution iff its working set is resident in main memory. A page may not be removed from main memory if it is in the working set of a process. Thrashing occurs when the working sets of the active processes cannot all fit in memory: size of working sets > # page frames If this happens, what should be done? If size of working sets << number of page frames, The working set is a local policy that allows the number of page frames used by executing processes to vary. When a process is suspended, the pages of its working set will eventually be swapped out. When it is reactivated, the workingset principle says the pages of its working set should be prepaged in. Why? Properties of working sets: The size of a working set can vary. If a process uses n pages, then 1 |W (t, )| min(, n ). So a fixed partition need not be allocated to a process. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 234 Working sets are inclusive: W(t, ) W(t, + 1 ) Working sets tend to increase temporarily as a program moves from one phase of execution to another: Working -set size Tran- StableTran- StableTran- Stable Transient sient sient sient Time Kahn showed that stable phases occupied about 98% of process time; nearly half the faults occurred during the other 2% of process time. Keeping track of working-set size: The Maniac II implementation was due to James B. Morris. It uses a working-set register, that contains one bit for each page frame in main memory; only the bits for the page frames of the active process are set. a T register, that controls the frequency of updating the counters described below; and a set of page-frame registers that contain several fields: DA clock D is the dirty bit; A is the alarm bit; and clock is a four-bit counter. The algorithm works like this: On each page reference, the alarm bit and counter are cleared. If the reference was a write, the dirty bit is set. Lecture 24 Operating System Principles Page 235 Periodically, at an interval determined by the T register, the counters are incremented, but only counters for those pages whose working-set register bits are on. The alarm bit is set whenever the associated clock overflows. When the alarm bit is set, the page is no longer in the working set of its process, so it is eligible to be swapped out. The T registers setting can be changed to tune the operation of the working-set replacement. The WSClock algorithm: One problem with the Maniac II algorithm is the overhead of updating all the counters, especially when the number of page frames is large. The WSClock algorithm approximates working-set replacement. As hand passes over a page frame f, its used bit is tested and reset as in the clock algorithm. If f s used bit was set, its time of last reference is assumed to be the current process time pt. This time is stored in the lf field of f s frame-table entry. If the bit was not set, the page is replaceable iff pt lf [f ] > T, where T is the working-set parameter. One variety of the WSClock algorithm uses a dirty bit as well as a used bit. When the hand passes a page whose dirty bit is set, that page is queued for cleaningbeing written out to disk. WSClock can be used for load control: When the hand goes all the way around the clock without encountering a replaceable frame, WSClock detects memory overcommitment. The page-cleaning queue is then examined. If there is an outstanding request for a page to be cleaned, it is processed to yield a replaceable page. If there is no request for a page to be cleaned, a process is suspended. 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 236 Load control in Unix (S&G, 21.6): Early (pre 3-BSD) versions of Berkeley Unix employed process swapping; more recent versions use paging. Swapping: In BSD systems that swap processes, system and user memory are kept in contiguous memory. This includes System memory u area (like the PCB), and kernel stack. User memory nonsharable code, data, and stack. Why do you suppose this is done? What problem does this exacerbate? Why is sharable code handled separately? Memory is allocated to processes using first-fit. If a process gets too large to fit in its area, some systems look for contiguous memory at the end of currently allocated memory, or the whole process is moved to a different area. Swapping decisions are made by the scheduler, or swapper. It wakes up at least once every four seconds. A process is more likely to be swapped out if it is idle, has been in main memory a long time, or Lecture 24 Operating System Principles Page 237 A process is more likely to be swapped back in if it Swap space in 4.3 BSD is allocated in pieces that are multiples of a power of 2 and a minimum size (e.g., 32 pp.) up to a maximum determined by the size of the swap-space partition on the disk. Paging: Unix uses demand paging, except that when a process is started, many of its pages are prepaged. Why is this done? But these pages are marked invalid, which means that they are awaiting swap-out. Why? Pageouts are done in clusters, too. On the VAX, I/O is done in 2-page chunks, or If a page is still in the page table, but is marked invalid pending swap-out, it can be marked valid and reclaimed. When paging is used, all code is by default shared and read only. The VAX, the second architecture on which Unix was implemented, did not have a used bit! So, how could it use a paging algorithm that favored pages that were recently referenced? The second-chance algorithm is used. A page is marked invalid when the hand sweeps past it. Upon reference to an invalid page, a page fault is generated. But the page-fault handler 1998 Edward F. Gehringer CSC/ECE 501 Lecture Notes Page 238 Later versions of BSD Unix use the clock algorithm. [ The paging algorithm tries to keep lots of frames free. The lotsfree parameter is usually set at ...

lectures

End of Preview

Sign up now to access the rest of the document