feitelson-backfill-ipps98

feitelson-backfill-ipps98 - Utilization and Predictability...

Info icon This preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Utilization and Predictability in Scheduling the IBM SP2 with Back lling Dror G. Feitelson Institute of Computer Science The Hebrew University of Jerusalem 91904 Jerusalem, Israel feit,ahumu @cs.huji.ac.il f g Ahuva Mu'alem Weil Scheduling jobs on the IBM SP2 system is usually done by giving each job a partition of the machine for its exclusive use. Allocating such partitions in the order that the jobs arrive FCFS scheduling is fair and predictable, but su ers from severe fragmentation, leading to low utilization. This motivated Argonne National Lab, where the rst large SP1 was installed, to develop the EASY scheduler. This scheduler, which has since been adopted by many other SP2 sites, uses aggressive back lling: small jobs are moved ahead to ll in holes in the schedule, provided they do not delay the rst job in the queue. We show that a more conservative approach, in which small jobs move ahead only if they do not delay any job in the queue, produces essentially the same bene ts in terms of utilization. Our conservative scheme has the added advantage that queueing times can be predicted in advance, whereas in EASY the queueing time is unbounded. Abstract 1 Introduction The scheduling scheme used on most distributed-memory parallel supercomputers is variable partitioning, meaning that each job receives a partition of the machine with its desired number of processors 2 . Such partitions are allocated in a rst-come rst-serve FCFS manner to interactive jobs that are submitted directly, and to batch jobs that are submitted via a queueing system such as NQS. But this approach su ers from fragmentation, where processors cannot meet the requirements of the next queued job and therefore remain idle. As a result system utilization is typically in the range of 50 80 12, 9, 4, 7 . It is well known that the best solutions for this problem are to use dynamic partitioning 11 or gang scheduling 3 . However, these schemes have practical limitations. The only 1 e cient and widely used implementation of gang scheduling was the one on the CM-5 Connection Machine; other implementations are too coarse-grained for real interactive support, and do not enjoy much use. Dynamic partitioning has not been implemented on production machines at all. A simpler approach is to just re-order the jobs in the queue, that is, to use non-FCFS policies 5 . Consider the following scenario, where a number of jobs are running side by side, and the next queued job requires all the processors in the system. A FCFS scheduler would then reserve all the processors that are freed for this queued job, and leave them idle. A non-FCFS scheduler would schedule some other smaller jobs, that are behind the big job in the queue, rather than letting the processors idle 8, 1 . Of course, this runs the danger of starving the large job, as small jobs continue to pass it by. The typical solution to this problem is to allow only a limited number of jobs to leapfrog a job that cannot be serviced, and then start to reserve and idle the processors. The point at which the policies are switched can be chosen so as to amortize the idleness over more useful computation, by causing jobs that create signi cant idleness to wait more before making a reservation. A somewhat more sophisticated policy is to require users to estimate the runtime of their jobs. Using this information, only short jobs | that are expected to terminate in time | are allowed to leapfrog a waiting large job. This approach, which is called back lling, was developed for the IBM SP1 parallel supercomputer installed at Argonne National Lab as part of EASY the Extensible Argonne Scheduling sYstem 10 , which has since been integrated with the LoadLeveler scheduler from IBM for the SP2 13 . The EASY back lling algorithm only checks that jobs that move ahead in the queue do not delay the rst queued job. We show that this approach can lead to unbounded queueing delays for other queued jobs, and therefore prevents the system from making de nite predictions as to when each job will run. We then go on to show that an alternative approach, in which short jobs are moved ahead only if they do not delay any job in the queue, has essentially the same bene ts as the more aggressive EASY algorithm. As this approach has the additional bene t of making an exact reservation for each job immediately when it is submitted, it is preferable to the EASY algorithm. The comparison of the algorithms is done both with a general workload model and with speci c workload traces from SP2 installations. 2 Back lling Back lling is an optimization in the framework of variable partitioning. In this framework, users de ne the number of processors required for each job and also provide an estimate of the runtime; thus jobs can be described as requiring a rectangle in processor time space Fig. 1. The jobs then run on dedicated partitions of the requested size. Note that users are motivated to provide an accurate estimate of the runtime, because lower estimates mean that the job may be able to run sooner, but if the estimate is too low the job will be killed when it overruns its allocation. Once runtime estimates are available, it is possible to predict when jobs will terminate, 2 processors runtime Figure 1: Graphical representation of a job in processor time space. and thus when the next queued jobs will be able to run. With FCFS scheduling, queueing time is estimated based on previous jobs in the queue. However, FCFS su ers from fragmentation and delays to short jobs that are stuck behind long ones. Back lling improves upon this by moving short jobs ahead in the queue to utilize holes" in the schedule. The name back lling" was coined by Lifka to describe the EASY scheduler for the Argonne SP1 10 , although the concept was also present in earlier systems e.g. 8 . It is desirable that a scheduler with back lling will support two con icting goals: on one hand, it is desirable to move as many short jobs forward, in order to improve utilization and responsiveness. On the other hand, it is also desirable to avoid starvation for large jobs, and in particular, to be able to predict when each job will run. Di erent versions of back lling balance these goals in di erent ways. Conservative back lling is the vanilla version usually assumed in the literature e.g. 6, 3 , although it seems not to be used. In this version, back lling is done subject to checking that it does not delay any previous job in the queue. We call this version conservative" back lling to distinguish it from the more aggressive version used by EASY, as described below. Its advantage is that it allows scheduling decisions to be made upon job submittal, and thus has the capability of predicting when each job will run and giving users execution guarantees. Users can then plan ahead based on these guaranteed response times. Obviously there is no danger of starvation, as a reservation is made for each job when it is submitted. It is easier to describe the algorithm to decide if a certain job can be used for back lling as if it starts from scratch at each scheduling operation, with no information about prior commitments Fig. 2. This algorithm creates a pro le of free processors in future times as a linked list. Initially, this is a monotonically decreasing pro le based on the currently running jobs top of Fig. 3. Then the queued jobs are checked in order of arrival, to see if they can back ll and start execution immediately. However, jobs that cannot start immediately cannot be ignored. Rather, the pro le is scanned to nd when enough processors will be available for each queued job to start this point in time is called the anchor point for that job. Then scanning is continued to see that the required processors will stay available till it terminates. If so, the job is assigned to this anchor point, and the pro le is updated to re ect the processors allocated to it. 3 2.1 Conservative Back lling input: list of queued jobs with nodes and time requirements list of running jobs with node usage and expected termination times number of free nodes algorithm conservative back ll from scratch: 1. generate processor usage pro le of running jobs a sort the list of running jobs according to their expected termination time b loop over the list dividing the future into periods according to job terminations, and list the number of processors used in each period; this is the usage pro le 2. try to back ll with queued jobs a loop on the list of queued jobs in order of arrival b for each one, scan the pro le and nd the rst point where enough processors are available to run this job. this is called the anchor point i. starting from this point, continue scanning the pro le to ascertain that the processors remain available until the job's expected termination ii. if so, update the pro le to re ect the allocation of processors to this job iii. if not, continue the scan to nd the next possible anchor point, and repeat the check c the rst job found that can start immediately is used for back lling Figure 2: The conservative back lling algorithm, when run from scratch disregarding previous execution guarantees. An example is given in Fig. 3. The rst job in the queue does not have enough processors to run, so a reservation for it is made after the rst two running jobs terminate. The second queued job has a potential anchor point after only one job terminates, but that would delay the rst job; therefore the second anchor point is preferred. Thus adding job reservations to the pro le is the mechanism that guarantees that future arrivals do not delay previously queued jobs. The third job can be scheduled immediately, so it is used for back lling. It is most convenient to maintain the pro le in a linked list, as it may be necessary to split items into two when a newly scheduled job is expected to terminate in the middle of a given period. In addition, an item may have to be added at the end of the pro le whenever a job extends beyond the current end of the pro le. The length of the pro le is therefore proportional to the number of jobs in the system both queued and running, because each 4 free usage profile running jobs time now free 11111 00000 11111 00000 11111 00000 10000 01111 00000 11111 00000 11111 time anchor 11111 00000 11111 00000 11111 00000 10000 01111 00000 11111 00000 11111 1st queued job now free 11111 00000 11111111111 00000000000 1 0 11111111111 00000000000 1 0 11110111111 00001000000 1 0 1 0 00001111111 11110000000 11111 00000 1111111 0000000 1111111 1111 0000000 0000 1 0 0000000 0000 1111111 1111 11111111111 00000000000 1 10 01110111111 00001000000 1 1111111 1 0 0000000 00000 11111 time time anchor 1 anchor 2 1111111 0000000 1111111 0000000 1000000 0111111 1111111 0000000 2nd queued job backfill now free 11111111 00000000 11111111 00000000 11111111 00000000 3rd queued job now Figure 3: Example of conservative back lling. 5 job adds at most one item to the pro le. As the pro le is scanned once for each queued job, the complexity of the algorithm is quadratic in the number of jobs. The above algorithm leaves one question unanswered. Jobs are assigned a start time when they are submitted, based on the current usage pro le. But they may actually be able to run sooner because previous jobs terminated earlier than expected. The question is what to do when this happens. Options are do nothing, and allow future arrivals to use the idle processors via back lling, or use this to increase the exibility of the scheduling, as described below. initiate a new round of back lling when these resources become available. this can move small jobs way ahead of their originally assigned start time. retain the original schedule, but compress it. This stays closest to the start times decided when the jobs were submitted so it may be the most convenient for users. The second option | re-scheduling all the jobs | sounds very promising, but turns out to violate the execution guarantees made by conservative back lling. The guarantee is embodied in the system's prediction of when each job will run. As each job is submitted, the system scans the usage pro le, nds the earliest time that the new job can run without delaying any previous job, and guarantees that the job will start at this time or earlier. In some cases, this guaranteed time will be the result of back lling with this job. If a new round of back lling is done later, with di erent data about job runtimes due to an early termination, the same job may not be back lled and will therefore run much later than the guaranteed time. An example is given in Fig. 4: according to the original schedule, the second queued job can back ll and start at time 1, but after the bottom running job terminates much earlier than expected, the rst queued job can start earlier too, leaving no space for back lling. The second queued job therefore has to start at the later time 3. The preferred choice is therefore compression, meaning that the original schedule is retained, but each job is moved forward as much as possible. This can be done in either of two ways. In the rst, the pro le is re-generated from scratch, but the jobs are considered in the order they appear in the original schedule, rather than in the order of arrival. Returning to the example in Fig. 4, the second queued job stays in front and moves up from 1 to the time of the early termination, while the rst queued job moves up from 2 to 4. The second option is to retain the pro le and update it one job at a time. For each job, we remove it from the pro le, and then re-insert it at the earliest possible time. This approach has two advantages: rst, the jobs can be considered in the order of arrival, so jobs that are waiting longer get a better chance to move forward. Second, jobs provably do not get delayed, because at worse each job will be re-inserted in the same position is held previously. The use of compression also has another implication: as the schedule is maintained and isn't changed by future events, it also makes sense to maintain the usage pro le continuously. As jobs arrive and terminate, the pro le is updated rather than being re-generated from scratch each time. Thus the algorithm in Fig. 2 is replaced by two separate procedures: T T T T T 6 backfill original schedule running jobs time now 11111 00000 11111 111 00000 000 01111 111 10000 000 00000 111 11111 000 111 000 111 000 111 000 T1 T2 111 000 111 000 111 000 111 000 111 000 000 111 first 1111 0000 1111 0000 1111 0000 0000 1111 queued jobs repeated backfilling after early termination now 1111 0000 1111 0000 1111 1 0000 0 111 000 1111 1 0000 0 111 000 1111 1 0000 0 111 000 1000 0111 1111 0000 1111 0000 time T3 111 000 111 000 111 000 111 000 111 000 000 111 first 1111 0000 1111 0000 1111 0000 0000 1111 queued jobs compressed original schedule now 1111 0000 1111111 0000000 1111111 0000000 1 0 1111111 0000000 1 0 1111111 0000000 1 0 1 0 0000 1111 1111 0000 time T4 111 000 111 000 111 000 111 000 111 000 111 000 000 111 first 1111 0000 1111 0000 1111 0000 1111 0000 queued jobs Figure 4: Repeated back lling after a running job terminates earlier than expected may cause a job that was expected to back ll to actually run later than the original prediction. It is therefore better to just compress the original schedule. 7 input: list of queued jobs with nodes and time requirements list of running jobs with node usage and expected termination times number of free nodes algorithm EASY back ll: 1. nd the shadow time and extra nodes a sort the list of running jobs according to their expected termination time b loop over the list and collect nodes until the number of available nodes is su cient for the rst job in the queue c the time at which this happens is the shadow time d if at this time more nodes are available than needed by the rst queued job, the ones left over are the extra nodes 2. nd a back ll job a loop on the list of queued jobs in order of arrival b for each one, check whether either of the following conditions hold: i. it requires no more than the currently free nodes, and will terminate by the shadow time, or ii. it requires no more than the minimum of the currently free nodes and the extra nodes c the rst such job can be used for back lling Figure 5: The EASY back lling algorithm. Upon arrival, the rst possible starting time for the new job is found based on the current pro le, and the pro le is updated. This is just the inner loop of the original algorithm. Upon termination, the pro le is scanned and the schedule is compressed. The complexity of the insertion procedure is only linear in the number of jobs, rather than quadratic. The complexity of compression is quadratic, because the pro le is scanned again for each job. 2.2 EASY Back lling Conservative back lling moves jobs forward only if they do not delay any previously queued job. EASY back lling takes a more aggressive approach, and allows short jobs to skip ahead provided they do not delay the job at the head of the queue 10 . Interaction with other 8 backfill free 1111111111 0000000000 0000000000 1111111111 11111 00000 11111 00000 11111 00000 time shadow time 11111 00000 11111111111 00000000000 11111111111 00000000000 11111 00000 11111 00000 first queued jobs running jobs now free 11111111 00000000 11111111 00000000 11111 00000 00000000 0 11111111 1 11111 00000 11111 00000 1111 0000 11111 00000 time backfill extra 11111 00000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111 00000 first queued jobs running jobs now Figure 6: The two conditions for back lling in the EASY algorithm. jobs is not checked, and they may be delayed, as shown below. The objective is to improve the current utilization as much as possible, subject to some consideration of queue order. The price is that execution guarantees cannot be made, because it is impossible to predict how much each job will be delayed in the queue. Thus the algorithm is actually not as deterministic as stated in its documentation. The algorithm is shown schematically in Fig. 5. This algorithm is executed if the rst job in the queue cannot start, and identi es a job that can back ll if one exists. Such a job must require no more than the currently available processors, and in addition it must satisfy either of two conditions that guarantee that it will not delay the rst job in the queue Fig. 6: either it will terminate before the time when the rst job is expected to commence the shadow" time, or else it will only use nodes that are left over after the rst job has been allocated its nodes the extra" nodes. This algorithm has two properties that together create an interesting combination. Property 1 Queued jobs may su er an unbounded delay. 9 FCFS free running jobs time now 111 000 11 00 111 1 000 0 11 00 111 1 000 0 11 00 111 1 000 0 11 1111111 00 0000000 11111 00000 111 000 11111111111 00000000000 111 000 backfill 1111 0000 1111 0000 1111 0000 0111 1000 1111 0000 first queued jobs 11 00 11 00 11 1111111 00 0000000 0 1 11 00 00000000 11111111 11 00 11 00 11 00 11 00 11 00 11 00 11111111 00000000 11 1111111 00 0000000 0 1 11 00 00000000 11111111 11 00 11 00 00 11 EASY free running jobs 11111111 00000000 111 000 1000000000 0111111111 111 1 000 0 1111111100 0000000011 111 111 000 000 111 111 000 000 111 111 000 000 111 011 000 100 000 000 111 111 delay time now 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 first queued jobs Figure 7: In EASY, back lling may delay queued jobs. Proof sketch: The reason for this is that if a job is not the rst in the queue, new jobs that arrive later may skip it in the queue. While such jobs are guaranteed not to delay the rst job in the queue, they may indeed delay all other jobs. This is the reason that the system cannot predict when a queued job will eventually run. An example is shown in Fig. 7: the b...
View Full Document

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern