Compiler Optimizations for Shared-Memory Parallel Architectures

Exploiting Data Locality and Parallelism. We have developed new compiler optimizations for parallel scientific Fortran programs. The algorithms tailor complete applications to shared-memory multiprocessors with local caches and a bus, exploiting both parallelism, data locality, and their interaction. They are based on a cost model that accurately predicts cache line reuse from multiple accesses to the same memory location and from consecutive cache line accesses. Unlike other approaches that generate all possible loop organizations and try to select the best, the algorithm directly derives the best loop organization and then uses transformations to achieve it [MCT:96]. Using the result, the optimizer discovers and introduces complementary parallelism [KM:92, McKinley:98].

Using Fusion and Distribution. The optimizer further improves the profitability of parallelism by seeking to increase its granularity and remove barrier synchronization. We use fusion and distribution to improve both locality and the granularity of parallelism [KM:93, SM:97].

Intraprocedural Parallelization To make this strategy applicable to complete applications, rather than simple loop nests, we developed new techniques that effectively optimize across procedure boundaries via loop embedding and loop extraction [HKM:91] and in loops containing arbitrary conditional control flow [KM:90]. The resulting optimizer is efficient and several of its component algorithms are optimal.

Evaluation. We believe that good performance will rely on a combination of user parallelization as well as compiler optimization. To test our algorithm, we apply it to sequential versions of hand-coded parallel applications. Using a 20 processor Sequent, our experimental results illustrate its efficacy on 10 programs, written for a variety of parallel machines (Sequent, Alliant, Intel Hypercube). In all but one program, the optimizer matched or improved the performance of hand-crafted parallel programs (3 programs improved by an average of 23\%) [McKinley:98, McKinley:94].

The compiler improvements come from balancing locality and parallelism, and increasing the granularity of parallelism. The compiler also improves on user parallelized codes because it is inherently more methodical than a user. Most importantly, these results suggest that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines.




This work is supported by grants from Digital Equipment, NSF grant EIA-9726401, and an NSF Infrastructure grant CDA-9502639. Kathryn S. McKinley is supported by an NSF CAREER Award CCR-9624209. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.
Return to ALI home page.
(Last changed: May 26, 1998.)