Facilitating Efficient Synchronization of Asymmetric Threads on Hyper-Threaded Processors * Nikos Anastopoulos and Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering Computing Systems Laboratory { anastop,nkoziris } @cslab.ece.ntua.gr Abstract So far, the privileged instructions MONITOR and MWAIT introduced with Intel Prescott core, have been used mostly for inter-thread synchronization in operating systems code. In a hyper-threaded processor, these instructions offer a “performance-optimized” way for threads involved in synchro- nization events to wait on a condition. In this work, we ex- plore the potential of using these instructions for synchroniz- ing application threads that execute on hyper-threaded proces- sors, and are characterized by workload asymmetry. Initially, we propose a framework through which one can use MON- ITOR/MWAIT to build condition wait and notification primi- tives, with minimal kernel involvement. Then, we evaluate the efficiency of these primitives in a bottom-up manner: at first, we quantify certain performance aspects of the primitives that reflect the execution model under consideration, such as re- source consumption and responsiveness, and we compare them against other commonly used implementations. As a further step, we use our primitives to build synchronization barriers. Again, we examine the same performance issues as before, and using a pseudo-benchmark we evaluate the efficiency of our implementation for fine-grained inter-thread synchronization. In terms of throughput, our barriers yielded 12% better per- formance on average compared to Pthreads, and 26% com- pared to a spin-loops-based implementation, for varying levels of threads asymmetry. Finally, we test our barriers in a real- world scenario, and specifically, in applying thread-level Spec- ulative Precomputation on four applications. For this multi- threaded execution scheme, our implementation provided up to 7% better performance compared to Pthreads, and up to 40% compared to spin-loops-based barriers. 1 Introduction Simultaneous Multithreading (SMT) [8] allows a super- scalar processor to issue instructions from multiple indepen- * This research is supported by the PENED 2003 Project (EPAN), co- funded by the European Social Fund (75%) and National Resources (25%). dent threads to its functional units, in a single cycle. The mo- tivation behind this technique is to maximize the utilization of processor resources by exploiting the thread-level parallelism that can be extracted from an application. Hyper-threading technology [6] is Intel’s two-threaded, low-end approach to SMT. In a hyper-threaded processor, almost all resources are shared, and only the architectural state, along with any control- flow related structures, are replicated for each thread.
