Unformatted text preview: University of California, Berkeley College of Engineering Computer Science Division | EECS Spring 1998 D.A. Patterson Quiz 2 Solutions CS252 Graduate Computer Architecture Question 1: Bigger, Better, Faster?
A computer system has the following characteristics: Uses 10GB disks that rotate at 10000 RPM, have a data transfer rate of 10 MByte s for each disk, and have a 8 ms seek time Has an average I O size of 32 KByte Is limited only by the disks Has a total of 20 disks Each disk can handle only one request at a time, but each disk in the system can be handling a di erent request. The data is not striped all I O for each request has to go to one disk. a What is the average service time for a request? service time = seek time + rotational latency + transfer time seek time = 8 ms 1 min rotational latency = 10000 rotations 60 sec 1 rotation = 3 ms min 2 1sec transfer time = 10 220 Bytes 32kBytes = 3 125 ms service time = 8 ms + 3 ms + 3 125 ms = 14 125 ms b Given the average I O size from above and a random distribution of disk locations, what is the maximum number of I Os per second IOPS for the system?
: : : 1 IOPS = service time 1 = .014125 sec = 71 So, a single disk can support 71 IOPS. Therefore, the overall IOPS = 20 71 = 1420 IOPS. Someone suggests improving the system by using new, better disks. For the same total price as the original disks, you can get 11 disks that have 19 GB each, rotate at 12000 RPM, transfer at 12 MB s, and have a 6 ms seek time. c What would be the average service time for a request in the new system? service time = seek time + rotational latency + transfer time seek time = 6 ms 1 min rotational latency = 12000 rotations 60 sec 1 rotation = 2 5 ms min 2 1 sec transfer time = 12 220 Bytes 32 kBytes = 2 60 ms service time = 6 ms + 2 5 ms + 2 60 ms = 11 10 ms
: : : : : 2 Question 1 continued
d What is the maximum number of IOPS in the new system? 1 IOPS = service time 1 = 01110 sec = 90
: So, a single disk can support 90 IOPS. Therefore, the overall IOPS = 11 90 = 990 IOPS. e Treat the entire system as a M M m queue that is, a system with m servers rather than one, where each disk is a server. All requests are in a single queue. Requests may not overlap. Assume both systems receive an average of 950 I O requests per second. Assume that any disk can service any request. What is the mean response time of the old system? The new one? You might nd the following equation for an M M m queue useful: Server utilization = Arrival rate = Arrival rate Timeserver m 1 Timeserver m Timesystem = Timeserver 1 + m1Server utilization , Server utilization
= Old system: 014125 = 6709 utilization = 950 20 6709 Ts = 014125 1 + 20 1 , 0 6709 = 15 56 ms
: : : : : : New system: utilization = 950 1101110 = 9586 9586 Ts = 01110 1 + 11 1 , 0 9586 = 34 47 ms
: : : : : : f Which system has a lower average response time? Why? The system with 20 disks has a lower average response time. Even though each disk has worse performance, the larger number of disks means that the old system is capable of more IOPS, and hence has a lower utilization. Thus, the waiting time is much lower than on the new system. 3 Question 2: A MESI Situation
Figure 1 below shows the three-phase write-back cache coherence protocol from the book.
CRH BWM Invalid CRM, PRM Shared (read only) CRM PRM CWM, PWM BWM, WB M ,W B, B M ,W M ,C W PR M
CWM PWM Exclusive (read/write) CWH CRH Figure 1: Three-Phase Protocol The following terminology is used: CPU stimulus causing transition Operation on bus causing transition CPU action on bus Label CRH CRM CWH CWM BRM BWM PRM PWM WB Stimulus or action CPU read hit CPU read miss CPU write hit CPU write miss read miss for this block write miss for this block place CPU read miss on bus place CPU write miss on bus write back cache block CW BR 4 H ,P CR W M Question 2 continued
Figure 2 below shows a write-back MESI Modi ed, Exclusive, Shared, Invalid protocol. Assume that the processor is able to detect whether a read miss is a shared read miss or an exclusive read miss.
BWM CRMs, PRM Invalid Read Only (Shared) CRMs PRM CRH CWM, PWM CW H PW , C M WM CRMx, PRM M BR s, P M RM ,W , B WB CR CRH CWH Read/Write (dirty exclusive, or Modified) CWM PWM CRMx, PRM, WB CWH CWM, PWM Read Only (unshared, or clean Exclusive) CRMx PRM CRMs, PRM CR M P x, RM BWM, WB BRM Figure 2: MESI Protocol The following terminology is used: Label CRH CRMs CRMx CWH CWM BRM BWM PRM PWM WB Stimulus or Action CPU read hit CPU read miss shared CPU read miss exclusive CPU write hit CPU write miss read miss for this block write miss for this block place CPU read miss on bus place CPU write miss on bus write back cache block BW M
CRH CPU stimulus causing transition Operation on bus causing transition CPU action on bus 5 Question 2 continued
Here is a sequence of memory accesses. Assume only 2 processors, with the value 5 stored in address A1. All cache locations start out in the invalid state. P1 reads A1 P1 writes 10 to A1 P2 reads A1 P2 writes 15 to A1 Below are the actions that occur for the above sequence on a group of machines using the three-phase protocol. Mark in the table any of the actions that change when the machines use the MESI protocol. Only show the items that change. Extra blank lines have been provided for you to show your changes. There may be more blank lines than you need. Read" for bus state means that a processor is reading the value that is on the bus. In the table below, a bus action in one line a ects processors and memory in the next line. For bus actions, denote a shared read miss as RdMsS" and an exclusive read miss as RdMsX". For states, use Mod", Excl", Shar", or Inv" to represent the read write, read only unshared, read only shared, and invalid states. Use None" to represent an item that exists in the three-phase protocol, but not in the MESI protocol. P1 State Addr Val Shar A1 Excl A1 Shar A1 5 Excl A1 5 P1 Wr 10 to A1 Ex A1 10 Mod A1 10 P2 Rd A1 Ex A1 10 Mod A1 10 Shar A1 10 Shar A1 10 Shar A1 10 Shar A1 10 P2 Wr 15 to A1 Shar A1 10 Shar A1 10 Inv A1 Inv A1 Operation P1 Rd A1 P2 Bus State Addr Val Action Proc RdMs P1 RdMsX P1 Read Read WrMs P1 None Shar A1 RdMs P2 Shar A1 RdMsS P2 Shar A1 WrBk P1 Shar A1 WrBk P1 Shar A1 10 Shar A1 10 Excl A1 15 WrMs P2 Mod A1 15 WrMs P2 Excl A1 15 Mod A1 15 Memory Addr Val Addr Val A1 A1 5 A1 A1 5 A1 5 A1 5 A1 A1 5 A1 A1 A1 A1 A1 A1 A1 10 10 A1 A1 A1 A1 A1 5 5 10 10 10 10 6 Question 3: Cluster vs SMP
Evaluate the resource utilization while performing streaming I O on the following three architectures: A single workstation A cluster of workstations A symmetric multiprocessor SMP The basis for the rst two architectures is shown in Figure 3. The cluster is built of 8 copies of the single workstation and is shown in Figure 4. The workstation contains a 167 MHz processor with 512 KB of L2 cache and 128 Mbyte of memory. The memory bus is 128 bits wide and operates at 83.3 MHz. The workstation contains one 32-bit, 25 MHz I O bus called the S-Bus. Attached to this I O bus are two fast-wide 16-bit, 10 MHz SCSI controllers. In the cluster, a Myrinet network interface, which is a switch based network that can support 1280 Mbit s in each direction, is also installed in each machine; the machines are all connected to a single eight-port switch. Processor Memory Memory Bus
I/O Chip 128-bit, 83.3 MHz S-Bus
SCSI #1 SCSI #2 32-bit, 25 MHz 16-bit 10 MHz Myrinet Network Interface Myrinet Network Disk 1280 Mbit/s Figure 3: The Workstation
This problem is based on a simpli ed version of the study The Architectural Costs of Streaming I O: A Comparison of Workstations, Clusters, and SMPs" by Remzi H. Arpaci-Dusseau, Andrea C. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and David A. Patterson from the Fourth International Symposium on High-Performance Computer Architecture. The paper is available at http: www.cs.berkeley.edu remzi Postscript hpca98.ps.gz 7 Question 3 continued
Workstation Workstation Workstation Workstation Switch Myrinet Workstation Workstation Workstation Workstation Figure 4: The Cluster The SMP is shown in Figure 5. The system consists of four CPU Memory and four S-Bus I O boards connected via the GigaPlane memory bus. The GigaPlane is a 256-bit wide 83.3 MHz bus. Each CPU Memory board contains two 167 MHz processors each with 512 KB of L2 cache and 256 Mbyte of memory. Each I O board contains two S-Busses. Each S-Bus has one fast-wide 16-bit, 10 MHz SCSI controller. All communication is performed via loads and stores to shared memory. All memory accesses are uniform access time. CPU/Memory Board
Processor x4 S-Bus I/O Card x4 16-bit, 10 MHz Memory
S-Bus #1 SCSI #1 S-Bus #2
32-bit, 25 MHz SCSI #2 Processor I/O Chip I/O Chip GigaPlane 256-bit, 83.3 MHz Figure 5: The SMP 8 Question 3 continued
The streaming I O benchmark we will use is a sorting benchmark. The benchmark processes 100byte records that include 10-byte keys. The basic algorithm is the same on all three platforms. In the rst step, the records must be converted from the layout on disk to a format more suitable for e cient sorting. As records are read from disk, the key which is part of the record and a pointer to the full record are placed into buckets based on the top few bits of the key; this improves the cache behavior of the sort in two ways. First, the sort operates on only partial key, pointer pairs, thus copying only 8-bytes rather than 100-byte records as keys are compared and swapped. Second, the number of keys in each bucket matches the size of the second-level cache. The next step sorts the keys in each bucket. Assume that the data is initially randomly placed over all disks. The basic algorithm has been slightly tailored for best performance on each platform. Figures 6, 7, and 8 show a graphical representation of the read phase for each platform. The arrows show the order and direction of data that moves across busses, but does not show the relative sizes of each transfer. The following paragraphs refer to the numbers in those gures.
1 2 3 4 5 Memory Processor Figure 6: Workstation Sort Read Phase In the workstation read phase, the input le is read into the user's address space 1. These records are then copied to an input bu er 2, 3. Each key is examined 4, and a partial key, pointer pair is written into the correct bucket 5.
Disk / Net
1 2 3 4 5 6 7 8 9 Memory Processor Figure 7: Cluster Sort Read Phase In the cluster read phase, the input le is read into the user's address space 1. Records are then copied into one of 8 send bu ers 2, 3; as each bu er lls, it is sent to the appropriate destination processor 4. Upon receipt of records from other processors 5, records are copied into a record bu er 6, 7. Then, each key is examined, and a partial key, pointer pair is written into the bucket array 8, 9, as in the single workstation sort. 9 Question 3 continued
1 2 3 4 5 6 7 8 9 Memory Processor Figure 8: SMP Sort Read Phase In the SMP read phase, the input le is read into the user's address space 1. Records are then copied into an input bu er 2, 3. Each key is examined 4, and a partial key, pointer pair is written into the correct bucket bu er 5. When a bucket bu er lles, the processor copies the partial key, pointer pair 6, 7 and records 8, 9 into a global array. The GigaPlane bus can sustain 94 of its theoretical maximum transfer rate. The SCSI bus can sustain 80 of its theoretical maximum transfer rate. The workstation and cluster memory bus can sustain 75 of its theoretical maximum transfer rate. The S-Bus can sustain 55 of its theoretical maximum transfer rate. The table below shows the number of millions of instructions required to processes each megabyte of data on the disk for the di erent platforms. The di erences are mainly from overhead of sending and receiving network messages, and from slightly di erent ways of zeroing pages on the di erent platforms. Read Phase Workstation Cluster SMP
4.1 5.5 4.6 The table below shows the measured CPI for each platform while running the benchmark. Read Phase Workstation Cluster SMP
2.0 2.2 2.2 10 Question 3 continued
a Determine how much of each resource I O bus and memory bus is used during the read phase of the sort for each platform. First, write a general equation for how much of each resource is used in terms of the rate data is read from disk r , the number of processors in the cluster or SMP , and the sizes of the records rec , keys key , and partial key, pointer pairs bucket . r is the total rate that data is read from disks the sum of all the individual disk rates. Give the combined bandwidth required for all the busses. Then, ll in the table on the next page with the summary. Provide a short justi cation for these equations. The resource usage for the workstation sort has been completed as an example.
D P D Workstation
D D Memory Bus: During the read phase, data is read from disk r , then copied into memory 2 r . The keys are read key r , and partial key, pointer pairs written to the right bucket rec bucket r . rec I O Bus: Data is read from the disk r .
D D D Cluster
D Memory Bus: During the read phase, data is read from disk r , then copied into bu ers , 2 r . Then, blocks are sent to other processors PP 1 r , and received from other processors , PP 1 r then copied into bu ers 2 r . After this, keys are read key r , and partial key, rec pointer pairs are written bucket r . rec I O Bus: Data is read from the disk r and blocks are sent to and received from other , processors 2 PP 1 r .
D D D D D D D D SMP
D Memory Bus: During the read phase, data is read from disk r , then copied into bu ers 2 r . Then, each key is examined key r , and partial key, pointer pairs are written rec bucket r . Once the buckets ll, partial key, pointer pairs are copied 2 bucket r and rec rec records are copied 2 r into a global array . I O Bus: Data is read from the disk r .
D D D D D D 11 Question 3 continued
Resource Usage Workstation Memory Bus Workstation I O Bus Cluster Memory Bus Cluster I O Bus SMP Memory Bus SMP I O Bus
P D D D r + 2Dr + rec Dr +
D key bucket Dr rec r r r +2 D r , + 2 PP 1 D r +2 D + key rec D r+ bucket Dr rec D P ,1 r + 2 P Dr r +4 D r + rec Dr
D key + 3 bucket rec D r r b Fill in the values of the general equations from part a, using the following values: 8 processors in the cluster and SMP , 100-byte records rec, 10-byte keys key, and 8-byte partial key, pointer pairs bucket. Leave the term r in your equations. The read phase of the workstation sort has been completed as an example.
D Memory Bus Usage Workstation 3 18 Cluster 6 93 SMP 5 34
: : : D I O Bus Usage
D r r r r
D D 2 75
: D r D r 12 Question 3 continued
c Each disk can read data at 5.5 Mbyte s. Assume disks are organized the most e cient way possible the disks are equally spread over all the busses available. If we use 2 disks per processor, what is the utilization of each resource SCSI bus, I O bus, memory bus, processor during the read phase of the sort for only the cluster and SMP platforms? Determine utilization as a percent of maximum sustainable transfer rate for each bus. Cluster: 2 disks per processor, 2 SCSI busses per processor, therefore one disk per SCSI bus. 5.5 MB s .8 * 2 Bytes * 10 MHz = 34 38 SMP: 2 disks per processor, 1 SCSI bus per processor, therefore two disks per SCSI bus. 2 * 5.5 MB s .8 * 2 Bytes * 10 MHz = 68 75
: : SCSI Bus I O Bus Cluster: The cluster bandwidth required on the I O bus is 2.75 times the bandwidth read from disk. 2.75 * 2 * 5.5 MB s .55 * 4 Bytes * 25 MHz = 55 SMP: The SMP bandwidth required on the I O bus is the same as the bandwidth read from disk. 1 * 2 * 5.5 MB s .55 * 4 Bytes * 25 MHz = 20 13 Question 3 continued
Cluster: The cluster bandwidth required on the memory bus is 6.93 times the bandwidth read from disk. 6.93 * 2 * 5.5 MB s .75 * 16 Bytes * 83.3 MHz = 7 6 SMP: The SMP bandwidth required on the memory bus is 5.34 times the bandwidth read from disk. Since this is a shared memory bus, the total bandwidth required will be 8 times greater since we have 8 processors. 8 * 5.34 * 2 * 5.5 MB s .94 * 32 Bytes * 83.3 MHz = 18 75
: : CPU Cluster: The cluster requires 5.5 million instructions per megabyte of data. The CPI during the read phase is 2.2. 5.5 * 2.2 * 2 * 5.5 MB s = 79 70 167 MHz SMP: The SMP requires 4.6 million instructions per megabyte of data. The CPI during the read phase is 2.2. 4.6 * 2.2 * 2 * 5.5 MB s = 66 66 167 MHz
: : e Explain brie y which system scales the best in terms of adding more disks for this benchmark. The SMP can add another disk at full bandwidth for this benchmark while the cluster can not, because of the CPU utilization. 14...
View Full Document
- Spring '07
- Computer Architecture, Central processing unit, CPU cache, A1 A1 A1