Increased cache capacity "   higher associativity "   hardware prefetching of instructions and data "   equidistant locality "   second- level / third level cache (L2, L3) "   L3 often shared by multiple cores "   out of order instruction execution "   branch prediction All this makes modern CPUs highly complex. Improving cache performance: software "   Merging Arrays: Improve spatial locality by single array of structs vs. parallel arrays (Fortran). "   Loop Interchange: Change nesting of loops to access data in the order stored in memory. "   Loop Fusion: Combine 2 or more independent loops that have the same looping and some variables overlap. "   Blocking or "tiling" : Improve temporal locality by accessing "blocks" of data repeatedly vs. going down whole columns or rows. (prime sieve) Matrix Multiply Data or loop reordering for improve cache
