This section describes our research that uses and introduces a variety of dynamic, static, profile-driven and cooperative optimizations to improve performance. This section first discusses our new performance and memory analysis research tools and a few research results from these tools. We summaries our research on optimizations that improve garbage collection performance in relatively collector-neutral ways. We then describe ways to use the collector to improve memory system performance. We are also investigating JIT and ahead-of-time compiler optimizations. We describe our work evaluation of compiler optimizations in addition to specific compiler optimizations.
This section presents our work on a new tool that accurately simulates the Jikes RVM which is challenging because of dynamic code generation and other issues. We also discuss a methodology for using performance counters to analyze our programs.
In exploring the impact of compiler, run-time system, garbage collector, and architectural changes on computer performance, detailed simulations of computer architectures, at the level of clock-cycle timings, are indispensable. We needed a simulator on which we could simulate Jikes RVM, and it turned out that there was no cycle-level simulator available for its primary architecture (PowerPC) and that could support the requirements of Jikes RVM (such as dynamic code generation).
We designed, implemented, and validated extensions to an existing simulator, SimpleScalar, to create Dynamic SimpleScalar (DSS) [HBM+03]. DSS is a cycle-level simulator that simulates Java programs running on a Java Virtual Machine (JVM) with just-in-time compilation. It simulates a multi-way issue, out-of-order execution superscalar processor with a sophisticated memory system, and produces detailed cycle-level timing information. We add to the original SimpleScalar simulator support for signals, thread scheduling, synchronization, and dynamic code generation, all of which are required by our JVM. We validate our simulator running Jikes RVM. We executed a number of SPECjvm98 applications on a PowerPC architecture and show our simulation consistently reflects the execution characteristics of the actual computer. We also explored a few experimental results demonstrating the usefulness of our tool for finding a good heap size for a copying generational garbage collector/program pair.
At present we are making DSS more useful by improving it so that it can simulate Jikes RVM under the Linux operating system as well as under IBM's AIX operating system, and so that we can run simulations on a wider range of computing platforms, notablly Intel (x86) machines under Linux. Beyond that infrastructure work we look forward to adding our other memory system design work to DSS and to incorporating support for estimating energy consumption.
We are also performing performance analysis using timers and have encountered a number of methodological issues.
To obtain end-to-end measurements, we access a cycle-accurate timer immediately before the program starts and immediately after the program ends. Initially, we ran each program five times in the same VM and took the minimum run time in order to minimize perturbation due to other processes on the workstation (even though we ran the experiments in single-user mode, there may still be some perturbation due to essential system processes). Since we use the adaptive compiler, the first run is the longest because it performs all the compilation.1 We perform a whole-heap garbage collection before every run in order minimize memory management interference between the runs.
Using the above methodology we encountered an anomaly: an optimization that should not have been affecting performance in a particular configuration was yielding a 4% performance improvement for one benchmark on the IA32 workstation. Given the small variation across runs that we were observing, 4% was quite significant. To explore this anomaly we performed multiple back-to-back runs of the VM, with each run performing five runs of the benchmark (as above). We found that the second run of the VM produced faster run times (by up to 2.4%) than the first run of the VM. The third run, on the other hand, yielded either the same run times as the second run or slightly faster (by up to 0.25%). Subsequent runs produced nearly the same run times (usually within 0.5%) as the third run. We speculate that the reduction in execution time from one VM run to the next was caused by a corresponding drop in the number of page faults (which we measured using getrusage). Presumably, during the first run of the VM, many of the physical pages were in use by the other active processes and thus the first run incurred many page faults. By the third run, most of the physical pages had been reclaimed from the other active processes and thus the VM run incurred few page faults.
Based on the above experience, we modified our methodology to start the VM multiple times, each time running a benchmark five times and recording the execution times. At the end of each VM run we noted how the execution times and page fault counts compared to the previous run. If they appear to have stabilized, we stopped, otherwise we conducted more VM runs. None of our benchmarks needed more than 3 runs of the VM to stabilize.
This section discusses a number of relatively collector-neutral optimizations for garbage collectors.
Pretenuring can reduce copying and tracing costs in garbage collectors by allocating long-lived objects into regions that the garbage collector will rarely, if ever, collect. We extend previous work on pretenuring [CHL98] as follows [BSH+01]. (1) We produce pretenuring advice that is neutral with respect to the garbage collector algorithm and configuration. We thus can and do combine advice from different applications. We find that predictions using object lifetimes at each allocation site in Java programs are accurate, which simplifies the pretenuring implementation. (2) We gather and apply advice to applications and the Jikes RVM. Our results demonstrate that building combined advice into Jikes RVM from different application executions improves performance regardless of the application Jikes RVM is compiling and executing. This build-time advice thus gives user applications some benefits of pretenuring without any application profiling. No previous work pretenures in the run-time system and compiler. (3) We find that application-only advice also consistently improves performance, but that the combination of build-time and application-specific advice is almost always noticeably better. (4) Our same advice improves the performance of generational and older-first collection, illustrating that it is relatively collector neutral. We are currently investigating applying this advice to Beltway as well. In addition, we are investigating generating advice statically and dynamically to avoid the need to profile.
The effectiveness of garbage collectors and leak detectors in identifying dead objects depends on the accuracy of their reachability traversal. Accuracy has two orthogonal dimensions: (i) whether the reachability traversal can distinguish between pointers and non-pointers (type accuracy), and (ii) whether the reachability traversal can identify memory locations that will be dereferenced in the future (liveness accuracy). We presented an experimental study of the importance of type and liveness accuracy for reachability traversals [HHDH02]. We showed that liveness accuracy reduces the reachable heap size by up to 62% for our benchmark programs. However, the simpler liveness schemes (e.g., intraprocedural analysis of local variables) are largely ineffective for our benchmark runs: one must analyze global variables using interprocedural analysis to obtain significant benefits. Type accuracy has an insignificant impact on a garbage collector's ability to find unreachable objects in our benchmark runs. We report results for programs written in C, C++, and Eiffel.
In many garbage collected systems, the mutator performs a write barrier for every pointer update. Using generational garbage collectors, we study in depth three code placement options for remembered-set write barriers: inlined, out-of-line, and partially inlined (fast path inlined, slow path out-of-line). The fast path determines if the collector needs to remember the pointer update. The slow path records the pointer in a list when necessary. Efficient implementations minimize the instructions on the fast path, and record few pointers (from 0.16 to 3% of pointer stores in our benchmarks). We find the mutator performs best with a partially inlined barrier, by a modest 1.5% on average over full inlining.
We also study the compilation cost of write-barrier code placement. We find that partial inlining reduces the compilation cost by 20 to 25% compared to full inlining. In the context of just-in-time compilation, the application is exposed to compiler activity. Regardless of the level of compiler activity, partial inlining consistently gives a total running time performance advantage over full inlining on the SPECjvm98 benchmarks. When the compiler optimizes all application methods on demand and compiler load is highest, partial inlining improves total performance on average by 10.2%, and up to 18.5%.
We plan to extend this work and study a wider variety of barriers, such as cards, and use Beltway and Older-First, which make heavier use of write-barriers.
We are also exploring the effects of the large-object space which directly puts large objects into a mark-sweep free-list space to avoid copying them. We are investigating policies and cut-off sizes for this space. We also implemented a small optimization which puts the type-information-block (TIB) for each class in the immortal space. TIBs include the object size, pointer fields, and offsets. This optimization removed a source of unnecessary cross region pointers.
Our experimental work with GCJ (an ahead-of-time compiler for Java from GNU) and DSS using stock benchmarks has demonstrated potential for improvement in memory performance of Java garbage collection by up to 40%. We have extended DSS to model accurately the timing related to memory bandwidth. Currently we are measuring a modified BDW (Boehm-Deutsch-Weiser) [BDW+90] collector that is intended to schedule better the marking in the mark phase of collection. This work is still in progress. We also investigated improving collector performance using prefetching in the collector itself and found that prefetching shows promise for improving copying collection [Cah02].
Many researchers are exploring both hardware and software techniques to increase the cache hit ratio to achieve maximum performance. Loop and data transformations have been quite successful in the case of regular scientific programs where data accesses are fairly predictable (linear loop nests over arrays). However, for object-oriented programs where data accesses are less predictable, there has been little success so far.
We are also exploring run-time software techniques that improve cache performance of object-oriented programs [Sin02]. We propose a range of data layout schemes utilizing runtime profiling and feedback. We first consider the order in which a copying collector copies the objects: breadth-first (default), depth-first, and hierarchical. We use the Jikes RVM optimizing compiler and virtual machine with GCTk. We find the hierarchical scheme results in the best application times in a generational collector, but that the additional overheads sometimes hides the benefit. We design a mechanism in which the Java class-loader indicates to the garbage collector in what order to copy class objects, and thus they cooperate to produce better runtime data organizations. We use whole-program profile information to derive appropriate field orderings depending on which ones are accessed most frequently. We were not able to achieve consistent improvements, although we identify conditions when and to what extent our techniques are likely to be useful in real world object-oriented programs. We are currently investigating better tracing mechanisms that weigh longer lifetime objects more heavily and computes correlations between distinct object accesses. This profile information is yielding better results so far.
To determine which optimizations pay for themselves in a JIT setting, and to more fully understand the importance of individual optimizations and their interactions, we study several compiler optimizations in the Jikes RVM using the JIT compiler [LDM03]. To understand the interactions between optimizations, we evaluate optimizations both in isolation and in the presence of other optimizations. To increase the generality of our results we present results for a nine Java programs (including the SPECjvm98 benchmarks) and two architectures (IA32 and PowerPC). These results will be useful to compiler writers, particularly when compilation cost is a concern (e.g., on a battery-powered handheld or a JIT compiler that interleaves compilation with execution).
Besides presenting the results that are likely to be of interest to compiler writers, this paper also introduces a new methodology for measuring the synergy between optimizations and for evaluating the performance of optimizations.
Our results show that linear-scan register allocation and inlining are the first and second most important optimizations in the Jikes RVM. We also show that the performance (cost and benefit) of optimizations depends on the architecture (and more specifically size of the register set) and on what other optimizations are also being performed.
This section describes our investigation of a number of compiler optimizations for improving program performance.
Partial redundancy elimination (PRE) is a program transformation that optimizes by identifying expressions that are redundant on at least one (but not necessarily all) execution paths, inserting computations along edges on which such an expression is not redundant, and then eliminating the expression by reloading its value from a temporary. Chow et al. [Chow97, Kennedy99] devised an algorithm for performing partial redundancy elimination on intermediate representation in static single assignment (SSA) form which does not break SSA. Their algorithm, however, makes assumptions about the name space which may not be valid if the compiler performs other optimizations first. We developed an algorithm based on Chow's which makes no assumptions about the name space, or optimization order [VH:03]. The algorithm formulates PRE in an SSA setting and is as powerful as those of Bodik's path-based common-subexpression elimination [BGS98,BA98].
We are looking at alternatives approaches to implementing Java's synchronization primitives [WH03]. Instead of using lock-based mutual exclusion, which results in serial execution of synchronized sections (i.e., blocks or methods), we are considering implementing the synchronization primitives as transactional monitors, permitting concurrent execution of synchronized blocks/methods so long as those executions do not interfere. Transactional monitors can be considered to be streamlined version of an optimistic concurrency control facility, allowing multiple threads to execute a synchronized section so long as they do not violate serializability of the accesses that are performed. We are currently implementing and evaluating alternative implementations for transactional monitors. We are looking at alternative approaches to implementing Java's synchronization primitives that address the shortcomings of mutual exclusion locks for large-scale complex concurrent programming [WH03]. A transactional monitor allows multiple threads concurrent access to the data it protects. To ensure desired atomicity and serializability invariants, thus preserving Java's concurrency semantics, we track operations performed within the dynamic context of these monitors such that updates to shared data become visible to other threads only when they do not violate serializability invariants.
Transactional monitors can be considered to be a streamlined version of an optimistic concurrency facility. Optimistic concurrency protocols allow multiple threads to simultaneously access a synchronized section, but ensure that updates performed by a thread are made visible to other threads only if they do not introduce data races on shared data within the section. We present two implementations of a transactional monitors: a lightweight implementation that works well when contention for shared data is low and a more heavyweight implementation better suited to handle highly concurrent accesses. We show that transactional monitors have modest overheads, and in some cases may outperform implementations based on mutual exclusion locks, while at the same time elegantly handling such issues as priority inversion and deadlocks.
Modern hand held computers are certainly capable of running general purpose applications, such as Java virtual machines. However, short battery life rather than computational capability often limits the usefulness of hand held computers. We have considered how to reduce the energy consumption of Java applications [PLDM02].
Broadly speaking, there are three interleaved steps in running Java programs in a compiled environment: downloading the bytecodes, compiling and possibly optimizing the bytecodes, and running the compiled code. Optimized code typically runs faster than non-optimized code but the optimization process itself may consume significant energy. We considered the possibility of moving compilation (optimizing or non-optimizing) to a tethered server. We demonstrated that there is a significant benefit to moving compilation to a server (up to 67% reduction in energy for a realistic hand held configuration). We also demonstrated that there is no single best compilation strategy for all methods.
In the above study, we considered optimizations at a coarse granularity (none, O1, O2, O3). These levels are perhaps too coarse to be used in energy-limited environments. Thus, we are exploring the costs and benefits of individual optimizations along with the synergy between different optimizations. The overall goals of this work are as follows: (i) To identify the optimizations (or groups of optimizations) that yield the best cost-benefit tradeoff; and (ii) To better understand when optimizations are beneficial and when they are not.
Now that we have a good understanding of the costs and benefits of optimizations with respect to timing, we will explore the energy costs. How can we optimize each method so as to minimize its energy consumption? Our prior results indicate that there is significant benefit to picking the right strategy for each method. We will start by using machine learning methods to match methods with optimizations.
While caches are effective at avoiding most main-memory accesses, the few remaining memory references are still expensive. Even one cache miss per one hundred accesses can double a program's execution time. To better tolerate the data-cache miss latency, architects have proposed various speculation mechanisms, including load-value prediction. A load-value predictor guesses the result of a load so that the dependent instructions can immediately proceed without having to wait for the memory access to complete.
To use the prediction resources most effectively, speculation should be restricted to loads that are likely to miss in the cache and that are likely to be predicted correctly. Prior work has considered hardware- and profile-based methods to make these decisions. Our work focused on making these decisions at compile time. We show that a simple compiler classification is effective at separating the loads that should be speculated from the loads that should not. We presented results for a number of C and Java programs and demonstrated that our results are consistent across programming languages and across program inputs [BDH02].
We are continuing research in this area by investigating phase behavior in Java programs. Since different phases may have widely different behavior, customizing memory system optimizations (e.g., should we predict loads or not?) should yield significant performance benefits.
In this work, we introduce a compiler algorithm to add prefetches to tolerate latency for linked data structure and array traversals [Cah02,CM01,CM02]. We consider both because Java is also becoming a viable choice for numerical algorithms due to the software engineering benefits of object-oriented programming. Because these programs still use large arrays heavily, they continue to suffer from poor memory performance. To hide memory latency, we describe a new unified compile-time analysis for software prefetching arrays and pointers in Java. Our work uses data flow analysis to discover linked data structure accesses, and then we are able to generalize it to also identify loop induction variables used in array accesses. This work is the first to provide a formal basis for prefetch analysis. We then use this analysis to introduce greedy and jump pointer prefetches. A greedy prefetch is for a directly connected object, and a jump pointer is an additional field that points to a linked object that is not directly connected. We show this technique is effective for many of the Java Olden benchmarks that our research group wrote in an object-oriented style from the C versions.
Our algorithm also schedules prefetches for all array references that contain induction variables. We evaluate our technique on a set of array-based Java programs, and we report improvements greater than 15% in 6 of the 12 programs. Across all our programs, prefetching reduces execution time by an average of 23.5%, and the largest improvement is 57.5%. We added alias analysis, general pointer and index variable recurrence analysis, and software prefetching to Vortex [DDG+96], an ahead-of-time compiler for Java and other languages. We chose Vortex before we obtained the Jikes RVM.
Traditional software prefetching algorithms for C and Fortran use locality analysis and sophisticated loop transformations. Because our analysis is much simpler and quicker, it is suitable for including in a just-in-time compiler, although we use traditional compilation. We further show that the additional loop transformations and careful scheduling of prefetches used in previous work are not always necessary for modern architectures and Java programs. We are currently porting this work to the Jikes RVM.
The Jikes RVM and another VM-in-Java project, the Open Virtual Machine project (OVM) [GPV:01], both confront the difficult problem of expressing important non-Java behaviors [alpe:99, FHV:03]. Any such project needs mechanisms for this purpose, but different choices in the VM design affect the shape those mechanisms can take. For example, the JMTk memory management toolkit must be able to copy and modify pointer locations and this functionality cannot be expressed in Java. In exploring how a component such as JMTk that makes non-trivial use of such mechanisms and can be made to interoperate with OVM and the Jikes RVM, we design new OVM mechanisms and both influence and borrow from JMTk in the Jikes RVM. We design and implement a number of new idioms that use familiar Java syntax at the source level, and familiar, recognizable design patterns, for problems such as VM configurability that involve ad hoc techniques in prior work, and we transform the idioms to code without the indirection and inefficiency that would otherwise weigh against the familiar patterns.
Hardware trends have increased the disparity of processor and main memory performance. Processors are becoming faster and faster while the main memory performance has not kept up the pace. It takes many cycles to fetch data from the main memory. Data caches, which are much faster and much smaller in size bridge the performance gap to some extent by storing the most frequently accessed data items. For the best performance, it has become crucial to have almost all the data needed in the cache rather than having to access the much slower main memory.
In this section, we describe work on cooperative hardware/software optimizations. Our current implementations are for C and Fortran programs, but if we are able to obtain additional funding, we plan to explore these ideas in the Java setting in the future.
Most architectures use low set-associative caches with LRU replacement policies to combine fast access with relatively low miss rates. To improve replacement decisions in set-associative caches, we develop a new set of compiler algorithms that predict which data will and will not be reused and provide hints to the architecture [WMRW02]. We prove that these hints either match or improve hit rates over LRU. We describe a practical one-bit cache-line tag implementation of our algorithm, called evict-me. On a cache replacement, the architecture will replace a line for which the evict-me bit is set, or if none is set, it will use the LRU bits. We implement our compiler analysis and its output in the Scale compiler. On a variety of scientific programs written in Fortran, using the evict-me algorithm in both the level 1 and 2 cache improves simulated cycle times by up to 34% over the LRU policy by increasing hit rates. In addition, a combination of simple hardware prefetching and evict-me works together to further improve performance. We intend to investigate using evict-me for Java array and pointer based programs as well. This work is the first to suggest and show how to improve cache replacement decision.
In this work, we combine compiler hints with aggressive hardware prefetching [WBM+03]. We used compiler analysis to generate hints for filtering the level-2 cache prefetches generated by the Scheduled Region Prefetching (SRP) mechanism [LRB01]. SRP aggressively prefetches data into the L2 cache, and improves performance by 22%. But it essentially uses all of the available bandwidth on a Rambus channel, increasing bus utilization by 180% over a system with no prefetching. Our compiler analysis is able to reduce this utilization to just 23% more than a system with no prefetching, without sacrificing performance. This both enables shared-memory multiprocessing in the presence of aggressive prefetching, and reduces power consumption.
Capitulating loads (c-loads) refine the notion of informing loads [HMMS98], that combine the functionality of prefetch instructions with an informing load. On cache hit, a capitulating load operates like a normal load instruction, returning the target value in the result register, while also returning a non-zero success code in a separate status register. On cache miss, a capitulating load functions like a prefetch instruction, issuing a request to the memory subsystem to load the target value, while also returning a failure code (zero) in the status register. Programs can thus use capitulating loads to reschedule their execution dynamically in response to the availability of data in the cache, by executing conditional code dependent on the value of the status register for a given c-load.
We consider how c-loads can be applied to the domain of garbage collection, showing that they can yield performance improvements that are significantly better than can be obtained by existing prefetch-based techniques. The results for small caches (as may be found on embedded processors) are particularly pronounced, showing better stability for performance as compared to prefetching.
We explore a high performance cache structure with a hardware prefetching mechanism that enhances exploitation of spatial and temporal locality [LKW03]. The proposed cache, which we call a Selective-Mode Intelligent (SMI) cache, consists of three parts: a direct-mapped cache with a small block size, a fully associative spatial buffer with a large block size, and a hardware prefetching unit. Temporal locality is exploited by selectively moving small blocks into the direct-mapped cache after monitoring their activity in the spatial buffer for a time period. Spatial locality is enhanced by intelligently prefetching a neighboring block when a spatial buffer hit occurs. The overhead of this prefetching operation is shown to be negligible. We also show that the prefetch operation is highly accurate: Over 90 percent of all prefetches generated are for blocks that are subsequently accessed. Our results show that the system enables the cache size to be reduced by a factor of four to eight relative to a conventional direct-mapped cache while maintaining similar performance. Also, the SMI cache can reduce the miss ratio by around 20 percent and the average memory access time by 10 percent, compared with a victim-buffer cache configuration.
We are now exploring the use of the SMI organization as a small level-0 cache that would reside between the L1 and the processor. The need for such a cache is driven by concerns that the size of L1 caches is becoming limited by wire delays. A SMI cache would enable the use of a very small, fast, single-cycle cache at the top of the hierarchy, and relax the timing constraint on the conventional L1 cache. Another area that we are exploring for the SMI cache is to use it with applications that have large objects. If the application is accessing a small subset of the fields in such objects, then SMI will selectively promote these fields to the temporal cache, and improve utilization. If the application is using a large fraction of the fields, then the adaptive prefetching should exploit the greater spatial locality. This effort is currently in the stage of extending our Java simulator (DSS) to include SMI, and we will begin to experiment with these two scenarios.
Acknowledgments and Disclaimer
This work is supported by NSF ITR CCR-0085792 (this grant). In addition, this work is synergistic with the following grants and support from NSF ACI-9982028, NSF EIA-9726401, NSF CDA-9502639, NSF Career CCR-0133457, NSF IIS-9988637, an IBM faculty partnership award, DARPA F30602-98-1-0101, DARPA F33615-01-C-1892, DARPA F33615-01-C-1894, IBM, Microsoft, and Compaq. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.
For questions or comments contact hoffmann@cs.umass.edu.Copyright 2001-2003 by the DaCapo
Project,
All Rights Reserved.