{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

COS 226_07 - Acknowledgement COS 226 Chapter 7 Spin Locks...

Info icon This preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Acknowledgement COS 226 Chapter 7 Spin Locks and Contention Some of the slides are taken from the companion slides for “The Art of Multiprocessor Programming” by Maurice Herlihy & Nir Shavit Focus so far: Correctness and Progress Models Accurate But idealized New Focus: Performance Models More complicated Still focus on principles Protocols Elegant Important But naïve Protocols Elegant Important And realistic Kinds of Architectures SISD (Uniprocessor) Single instruction stream Single data stream Kinds of Architectures SISD (Uniprocessor) Single instruction stream Single data stream SIMD (Vector) Single instruction Multiple data SIMD (Vector) Single instruction Multiple data Our space MIMD (Multiprocessors) Multiple instruction Multiple data. MIMD (Multiprocessors) Multiple instruction Multiple data. 1 MIMD Architectures Concurrency issues Memory contention: Not all processors can access the same memory at the same time and if they try they will have to queue Contention for communication medium: memory Shared Bus Distributed If everyone wants to communicate at the same time, some of them will have to wait Communication latency: If takes more time for a processor to communicate with memory or with another processor. New goals Think of performance, not just correctness and progress Understand the underlying architecture Understand how the architecture affects performance Start with Mutual Exclusion What should you do if you can’t get a lock? Keep trying “spin” or “busy-wait” Good if delays are short Give up the processor Suspend yourself and ask the schedule to create another thread on your processor Good if delays are long our focus Always good on uniprocessor Basic Spin-Lock Basic Spin-Lock …lock suffers from contention . . . CS spin lock critical section Resets lock upon exit . . . CS spin lock critical section Resets lock upon exit 2 Contention Contention: When multiple threads try to acquire a lock at the same time Welcome to the real world Java Lock interface java.util.concurrent.locks package Lock mutex = new LockImpl (…); … mutex.lock(); try { … } finally { mutex.unlock(); } High contention: There are many such threads Low contention: The opposite Why don’t we just use the Filter or Bakery Locks? The principal drawback is the need to read and write n distinct locations where n is the number of concurrent threads This means that the locks require space linear in n What about the Peterson lock? class Peterson implements Lock { boolean private boolean flag = new boolean[2]; private int victim; public void lock() { ThreadID.get(); int i = ThreadID.get(); int j = 1 – i; flag[i] flag[i] = true; victim = i; (flag[j flag[j] while (flag[j] && victim == i) {}; } Peterson lock? It is not our logic that fails, but our assumptions about the real world We assumed that read and write operations are atomic Our proof relied on the assumption that any two memory accesses by the same thread, even to different variables, take place in program order Why does it not take place in program order? Modern multiprocessors do not guarantee program order Due to: Compilers reorder instructions to enhance performance Multiprocessor hardware itself writes to multiprocessor memory do not necessarily take effect when they are issued writes to shared memory are buffered and written to memory only when needed 3 How can one fix this? Memory barriers (or memory fences) can be used to force outstanding operations to take effect It is the programmer’s responsibility to know when to insert a memory barrier However, memory barriers are expensive Review: Test-and-Set Boolean value Test-and-set (TAS) Swap true with current value Return value tells if prior value was true or false Can reset just by writing false TAS aka “getAndSet” Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean newValue) getAndSet(boolean newValue) { boolean prior = value; newValue; value = newValue; return prior; } } Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean newValue) getAndSet(boolean newValue) { boolean prior = value; newValue; value = newValue; return prior; } Package } java.util.concurrent.atomic Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean newValue) getAndSet(boolean newValue) { boolean prior = value; newValue; value = newValue; return prior; } } Review: Test-and-Set AtomicBoolean lock AtomicBoolean(false) = new AtomicBoolean(false) … lock.getAndSet(true) boolean prior = lock.getAndSet(true) Swap old and new values 4 Review: Test-and-Set AtomicBoolean lock AtomicBoolean(false) = new AtomicBoolean(false) … lock.getAndSet(true) boolean prior = lock.getAndSet(true) Test-and-Set Locks Locking Lock is free: value is false Lock is taken: value is true Acquire lock by calling TAS Swapping in true is called “test-and-set” or TAS If result is false, you win If result is true, you lose Release lock by writing false Test-and-set Lock class TASlock { AtomicBoolean state = AtomicBoolean(false); new AtomicBoolean(false); void lock() { (state.getAndSet(true state.getAndSet(true)) while (state.getAndSet(true)) {} } void unlock() { state.set(false); state.set(false); }} Test-and-set Lock class TASlock { AtomicBoolean state = AtomicBoolean(false); new AtomicBoolean(false); void lock() { (state.getAndSet(true state.getAndSet(true)) while (state.getAndSet(true)) {} } void unlock() { state.set(false); state.set(false); is Lock state }} AtomicBoolean Test-and-set Lock class TASlock { AtomicBoolean state = AtomicBoolean(false); new AtomicBoolean(false); void lock() { (state.getAndSet(true state.getAndSet(true)) while (state.getAndSet(true)) {} } void unlock() { state.set(false); state.set(false); Keep trying until }} Test-and-set Lock class TASlock { Release lock AtomicBoolean state = by resetting AtomicBoolean(false); new AtomicBoolean(false); state to false void lock() { (state.getAndSet(true state.getAndSet(true)) while (state.getAndSet(true)) {} } void unlock() { state.set(false); state.set(false); }} lock acquired 5 Performance Experiment n threads Increment shared counter 1 million times Graph no speedup because of sequential bottleneck How long should it take? How long does it take? time ideal threads Mystery #1 TAS lock Test-and-Test-and-Set Locks Lurking stage Wait until lock “looks” free Spin while read returns true (lock taken) time Ideal Pouncing state As soon as lock “looks” available Read returns false (lock free) Call TAS to acquire lock If TAS loses, back to lurking threads What is going on? Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = AtomicBoolean(false); new AtomicBoolean(false); void lock() { while (true) { (state.get state.get()) while (state.get()) {} (!state.getAndSet(true state.getAndSet(true)) if (!state.getAndSet(true)) return; } } Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = AtomicBoolean(false); new AtomicBoolean(false); void lock() { while (true) { (state.get state.get()) while (state.get()) {} (!state.getAndSet(true state.getAndSet(true)) if (!state.getAndSet(true)) return; } Wait until lock looks } free 6 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = AtomicBoolean(false); new AtomicBoolean(false); Mystery #2 TAS lock TTAS lock Ideal void lock() { acquire while (true) { (state.get state.get()) while (state.get()) {} (!state.getAndSet(true state.getAndSet(true)) if (!state.getAndSet(true)) return; } } Then try to it time threads Mystery Both TAS and TTAS Do the same thing (in our model) Questions Why is the TTAS lock so good (that is, so much better than TAS)? Why is it so bad (so much worse than ideal)? Except that TTAS performs much better than TAS Neither approaches ideal Opinion Our memory abstraction is broken TAS & TTAS methods Are provably the same (in our model) Except they aren’t (in field tests) Bus-Based Architectures cache cache Bus cache Need a more detailed model … memory 7 Bus-Based Architectures Random access memory (10s of cycles) cache cache Bus Bus-Based Architectures Shared Bus •Broadcast medium •One broadcaster at a time •Processors and memory all “snoop” cache cache Bus cache cache memory memory Per-Processor Caches Bus-Based •Small Architectures •Fast: 1 or 2 cycles •Address & state information Jargon Watch Cache hit “I found what I wanted in my cache” Good Thing™ cache cache Bus cache memory Jargon Watch Cache hit “I found what I wanted in my cache” Good Thing™ Cave Canem This model is still a simplification But not in any essential way Illustrates basic principles Cache miss “I had to shlep all the way to memory for that data” Bad Thing™ W ill discuss complexities later 8 Processor Issues Load Request Processor Issues Load Request Gimme data cache cache Bus cache cache cache Bus cache memory data memory data Memory Responds Processor Issues Load Request Gimme data cache Got your data right here cache Bus cache Bus data cache Bus cache memory data data memory data Processor Issues Load Request Gimme data Processor Issues Load Request I got data data cache Bus cache data cache Bus cache memory data memory data 9 Other Processor Responds I got data Other Processor Responds data cache Bus cache Bus data cache Bus cache Bus memory data memory data Modify Cached Data Modify Cached Data data data Bus cache data data data Bus cache memory data memory data Modify Cached Data Modify Cached Data data data Bus cache data data Bus cache memory data What’s up with the other copies?memory data 10 Cache Coherence We have lots of copies of data Original copy in memory Cached copies at processors Write-Back Caches Accumulate changes in cache W rite back when needed Need the cache for something else Another processor wants it Some processor modifies its own copy What do we do with the others? How to avoid confusion? W rite-back coherence protocol: Invalidate other entries Requires non-trivial protocol … Write-Back Caches Cache entry has three states Invalid: contains raw seething bits (meaningless) Valid: I can read but I can’t write because it may be cached elsewhere Dirty: Data has been modified Intercept other load requests Write back to memory before using cache Invalidate data data Bus cache memory data Invalidate Mine, all mine! Invalidate Uh,oh data data Bus cache data cache data Bus cache memory data memory data 11 Invalidate Other caches lose read permission Invalidate Other caches lose read permission cache data Bus cache cache data Bus cache memory data This cache acquires write permission memory data Invalidate Another Processor Asks for Data Memory provides data only if not present in any cache, so no need to change it now (expensive) data cache cache Bus cache data Bus cache memory data memory data Owner Responds Here it is! End of the Day … cache data Bus cache data data Bus cache memory data memory data no writing Reading OK, 12 Back to TASLocks How does a TASLock perform on a write-back shared-bus architecture? Because it uses the bus, each getAndSet() call delays all the other threads Even those not waiting for the lock TASLock When the thread wants to release the lock it may be delayed because the bus is being monopolized by the spinners The getAndSet() call forces the other processors to discard their own cached copies – resulting in a cache miss every time They must then use the bus to fetch the updated copy What about the TTASLock? Suppose thread A acquires the lock. The first time thread B reads the lock it takes a cache miss and has to use the bus to fetch the new value As long as A holds the lock however, B repeatedly rereads the value – resulting in a cache hit every time B thus produces no extra traffic What about the TTASLock? However when A releases the lock: A writes false to the lock variable The spinner’s cached copies are invalidated Each one takes a cache miss They all use the bus to read a new value They all call getAndSet() to acquire the lock The first one to acquire the lock invalidates the others who must then reread the value Storm of traffic Local spinning Threads repeatedly reread cached values instead of repeatedly using the bus Exponential Backoff Recall that in the TTASLock, the thread first reads the lock and if it appears to be free it attempts to acquire the lock If I see that the lock is free, but then another thread acquires it before I can, then there must be high contention for that lock Better to back off and try again later 13 For how long should a thread back off? Rule of thumb: The larger number of unsuccessful tries, the higher the contention, the longer the thread should back off. What about lock-step? What happens if all the threads backs off the same amount of time? Instead the threads should back off for a random amount of time Each time the thread tries and fails to get the lock, it doubles the back-off time, up to a fixed maximum. Approach: Whenever the thread sees the lock has become free, but fails to acquire it, it backs of before retrying Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { (state.get state.get()) while (state.get()) {} (!lock.getAndSet(true lock.getAndSet(true)) if (!lock.getAndSet(true)) return; sleep(random() sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { (state.get state.get()) while (state.get()) {} (!lock.getAndSet(true lock.getAndSet(true)) if (!lock.getAndSet(true)) return; sleep(random() sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Fix minimum delay Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} (state.get()) state.get (!lock.getAndSet(true lock.getAndSet(true)) if (!lock.getAndSet(true)) return; sleep(random() sleep(random() % delay); if (delay < MAX_DELAY) delay = Wait until lock looks free 2 * delay; }}} Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} (state.get()) state.get (!lock.getAndSet(true lock.getAndSet(true)) if (!lock.getAndSet(true)) return; sleep(random() sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; we win, return If }}} 14 Exponential Backoff Lock public class Backoff implements lock { Otherwise back off for public void lock() { int delay = MIN_DELAY; random duration while (true) { (state.get state.get()) while (state.get()) {} (!lock.getAndSet(true lock.getAndSet(true)) if (!lock.getAndSet(true)) return; sleep(random() % delay); sleep(random() if (delay < MAX_DELAY) delay = 2 * delay; }}} Exponential Backoff Lock public class Backoff implements lock { public voidmax delay, within reason Double lock() { int delay = MIN_DELAY; while (true) { (state.get state.get()) while (state.get()) {} (!lock.getAndSet(true lock.getAndSet(true)) if (!lock.getAndSet(true)) return; sleep(random() % delay); sleep(random() if (delay < MAX_DELAY) delay = 2 * delay; }}} Spin-Waiting Overhead TTAS Lock Backoff: Other Issues Good Easy to implement Beats TTAS lock time Bad Backoff lock Must choose parameters carefully Sensitive to choice of minimum and maximum delays Sensitive to number of processors and their speed threads Cannot have a general solution for all platforms and machines BackoffLock drawbacks Cache-coherence Traffic: All threads spin on the same location Idea Avoid useless invalidations By keeping a queue of threads Critical Section Underutilization: Threads delay longer than necessary Each thread Notifies next in line Without bothering the others 15 Queue Locks Cache-coherence traffic is reduced since each thread spins on a different location No need to guess when to attempt to access lock – increase critical section utilization First-come-first-served fairness Anderson Queue Lock tail 0 idle flags T F F F F F F F Anderson Queue Lock tail 0 Anderson Queue Lock tail 1 acquiring getAndIncrement acquiring getAndIncrement flags flags T F F F F F F F T F F F F F F F Anderson Queue Lock tail Anderson Queue Lock tail acquired Mine! acquired acquiring flags flags T F F F F F F F T F F F F F F F 16 Anderson Queue Lock tail Anderson Queue Lock tail acquired acquiring acquired acquiring flags getAndIncrement F F F F F F F flags getAndIncrement F F F F F F F T T Anderson Queue Lock tail Anderson Queue Lock tail acquired acquiring released acquired Waiting flags flags T F F F F F F F F T F F F F F F Anderson Queue Lock tail Anderson Queue Lock class ALock implements Lock { boolean flags={true,false true,false, boolean flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> mySlot; ThreadLocal<Integer> mySlot; released acquired flags Yow! F T F F F F F F 17 Anderson Queue Lock class ALock implements Lock { boolean flags={true,false true,false, boolean flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> mySlot; ThreadLocal<Integer> mySlot; Anderson Queue Lock class ALock implements Lock { boolean flags={true,false true,false, boolean flags={true,false,…,false}; AtomicInteger tail = new AtomicInteger(0); ThreadLocal<Integer> mySlot; ThreadLocal<Integer> mySlot; One flag per thread Next flag to use Anderson Queue Lock class ALock implements Lock { boolean flags={true,false true,false, boolean flags={true,false,…,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> mySlot; ThreadLocal<Integer> mySlot; Anderson Queue Lock public lock() { next.getAndIncrement(); mySlot = next.getAndIncrement(); (!flags[mySlot while (!flags[mySlot % n]) {}; } public unlock() { flags[mySlot % n] = false; flags[(mySlot+1) % n] = true; } Thread-local variable Anderson Queue Lock public lock() { tail.getAndIncrement(); mySlot = tail.getAndIncrement(); (!flags[mySlot while (!flags[mySlot % n]) {}; } public unlock() { flags[mySlot % n] = false T = true; flags[(mySlot+1) % n] ake next } Anderson Queue Lock public lock() { next.getAndIncrement(); mySlot = next.getAndIncrement(); (!flags[mySlot while (!flags[mySlot % n]) {}; } public unlock() { flags[mySlot % n] = false; S % until told flags[(mySlot+1)pinn] = true; } slot to go 18 Anderson Queue Lock public lock() { next.getAndIncrement() myslot = next.getAndIncrement() % n; (!flags[myslot flags[myslot]) while (!flags[myslot]) {}; } public unlock() { flags[mySlot % n] = false; flags[(myslot+1) % n] = true; } Anderson Queue Lock public lock() { next.getAndIncrement(); mySlot = next.getAndIncrement(); Tell next thread to (!flags[mySlot while (!flags[mySlot % n]) {}; } public unlock() { flags[mySlot % n] = false; flags[(mySlot+1) % n] = true; } go Prepare slot for re-use Anderson Queue Lock Thread-local variables Are not stored in shared memory Do not require synchronization Do not generate coherence traffic Anderson Queue Lock Although the flags array is shared, contention on the array locations are minimised since each thread spins on its own locally cached copy of a single array location Performance TTAS Shorter handover than backoff Curve is practically flat Scalable performance FIFO fairness Anderson Queue Lock Good First truly scalable lock Simple, easy to implement Bad Not space efficient One bit per thread Unknown number of threads? Small number of actual contenders? queue 19 CLH Queue Lock Virtual Linked List keeps track of the queue Each thread’s status is saved in its node: True – has acquired the lock or wants to acquire the lock False – is finished with the lock and has released it Initially idle tail false Each node keeps track of its predecessors status Initially idle Initially idle Locked field: Lock is free tail false Queue tail tail false Initially idle Purple Wants the Lock acquiring tail false tail false 20 Purple Wa...
View Full Document

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern