Because processor performance is increasingly faster than memory performance, one of the greatest obstacles to obtaining peak processor performance today is getting data into the first level cache before the processor needs it. Our work focuses on three areas:
Understand Program Locality Properties. We are analyzing and quantifying the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast majority of cache optimization techniques target loop nests. In contrast, the locality characteristics that drive these optimizations are usually collected across the entire application rather than the nest level. Researchers have studied numerical codes for so long that a number of commonly held assertions have emerged on their locality characteristics. In light of these assertions, we use the SPEC'95 and Perfect Benchmarks to take a new look at measuring locality on numerical codes based on references, loop nests, and program locality properties. Our results show that several popular assertions are at best overstatements. For example, although most reuse is within a loop nest, in line with popular assertions, most misses are inter-nest capacity misses, and correspond to potential reuse between nearby loop nests. In addition, we find that temporal and spatial reuse have balanced roles within a loop nest and most reuse across nests and the entire program is temporal. These results are consistent with high hit rates (80% or more hits), but go against the commonly held assumption that spatial reuse dominates. Our locality measurements reveal important differences between loop nests and programs; refute some popular assertions; and provide new insights for the compiler writer and the architect. [MT:96].
Compiler Optimizations to Improve Locality. We developed compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments illustrate that for kernels our model and algorithm can select and achieve the best loop structure for a nest. For over 30 complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve because benchmark programs typically have high hit rates even for small data caches; however, our optimizations significantly improved several programs. [CMT:94, MCT:96].
When tiling is profitable, the model also provides a key insight for selecting which loop(s) to tile. We developed an algorithm that determines problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache. This algorithm selects tile sizes that prevent self-interference and achieves essentially the best possible performance achievable with loop transformations for tiled kernels [CM:95]. This result was verified through simulation and run-time experiments.
Scheduling to Hide Latencies. When misses are unavoidable, the most popular approach to date uses prefetching to hide latencies. Although prefetching is often effective, it may replace data in the cache that is still in use, it may fetch data already in cache (useless prefetches), and it may fetch data that is never used. We are therefore exploring a complementary approach which uses scheduling to hide latencies with instruction level parallelism. We are developing new scheduling algorithms which treat variable latencies specially. We are also exploring using static prediction, profiling, and performance counters to drive this scheduler.