tures, which provides higher prefetch lookahead. Their technique reduces storage andenergy overhead of accurate HW instruction prefetching (incurred in Ferdman et al.), while still maintaining high prefetcher coverage and program performance.4.5. Use of Branch Prediction or History InformationWe now discuss some techniques that use information from branch prediction to spec-ulate on program control flow for determining the data to be prefetched.Srinivasan et al.  present an instruction prefetching scheme that correlatesbranch instruction execution with misses in I-cache, based on the fact that control-flowalterations due to branches cause I-cache misses. Based on this, branch instructionsare used to trigger prefetching of instructions that appear in the execution after a fixednumber (say,K, for example,K=4, etc.) of branches. For instance, a candidate basicblock (BB1) will be associated with a branch instruction (R1) if an I-cache miss toBB1happens exactlyKbranches after the execution ofR1occurs. On future execution ofR1,BB1will be prefetched. Thus, their technique avoids the need of a branch predictor toestimate the result ofK+1 branch operations. Also, since another basic block starts atthe branch instruction target address and may last several cache lines, their techniquealso stores the length of prefetch candidate blocks along with their addresses to prefetchthe entire blocks in a timely manner.Zilles and Sohi  note that a few frequently executed static instructions (calledproblem instructions) cause a majority of branch mispredictions and cache misses.Their technique creates a code portion, called aspeculative slice, that mimics thecomputation including the problem instruction and includes only those operations thatare necessary to compute the outcome of problem instruction. By forking such sliceswell before the problem instruction, data prefetching can be done to avoid the penaltyof misses.4.6. Memory Side PrefetchingSolihin et al.  present a technique where CoR prefetching is performed by a userthread running in main memory, and the prefetched data are sent to the L2 cache. L2cache misses are tracked and recorded in a CoR table. Afterwards, for each miss, theCoR table is looked up and a prefetch of several lines is triggered for the L2 cache.ACM Computing Surveys, Vol. 49, No. 2, Article 35, Publication date: August 2016.
35:20S. MittalThe CoR table is stored in main memory and, thus, changes to L2 cache are minimal.They show that by combining their technique with a core-side sequential prefetcher,the performance improvement can be increased further. Also, the prefetch algorithmused by the thread can be adapted on a per-application basis.Yedlapalli et al.  present a memory-side prefetcher (MSP) that fetches dataon-chip from memory but, unlike in Solihin et al. , does not push the data to thecaches and, thus, avoids resource contention. They use a next-line prefetching schemeand prefetch when a row buffer hit occurs such that, first, the demand request is servedand then the prefetch request is served. Successive requests to lines in that row thenturn into prefetch hits. Data are maintained in a separate buffer at each memory