Using Fusion and Distribution. The optimizer further improves the profitability of parallelism by seeking to increase its granularity and remove barrier synchronization. We use fusion and distribution to improve both locality and the granularity of parallelism [KM:93, SM:97].
Intraprocedural Parallelization To make this strategy applicable to complete applications, rather than simple loop nests, we developed new techniques that effectively optimize across procedure boundaries via loop embedding and loop extraction [HKM:91] and in loops containing arbitrary conditional control flow [KM:90]. The resulting optimizer is efficient and several of its component algorithms are optimal.
Evaluation. We believe that good performance will rely on a combination of user parallelization as well as compiler optimization. To test our algorithm, we apply it to sequential versions of hand-coded parallel applications. Using a 20 processor Sequent, our experimental results illustrate its efficacy on 10 programs, written for a variety of parallel machines (Sequent, Alliant, Intel Hypercube). In all but one program, the optimizer matched or improved the performance of hand-crafted parallel programs (3 programs improved by an average of 23\%) [McKinley:98, McKinley:94].
The compiler improvements come from balancing locality and parallelism, and increasing the granularity of parallelism. The compiler also improves on user parallelized codes because it is inherently more methodical than a user. Most importantly, these results suggest that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines.