This section summarizes our findings on the following research topics: (1) analysis tools and results on object demographics and tracing, (2) memory management toolkit design and evaluation, and (3) new garbage collection algorithms. We show how our object demographic analysis leads to the design of our new algorithms and optimizations, and how our toolkit design eases the implementation of new collectors, and levels the playing field to enable fair and meaningful comparisons across algorithms and mechanisms.
This section describes our new mechanisms for analyzing and producing object lifetime traces.
To explore the design and evaluation of GC (garbage collection) algorithms quickly, researchers are using simulation based on traces of object allocation and lifetime behavior. The brute force method generates perfect traces using a whole-heap GC at every potential GC point in the program. Because this process is prohibitively expensive, researchers often use granulated traces by collecting only periodically, e.g., every 32K bytes of allocation.
We extend the state of the art for simulating GC algorithms in two ways [HBM+02]. First, we develop a systematic methodology and results on the effects of trace granularity for a variety of copying GC algorithms. We show that trace granularity often distorts GC performance results compared with perfect traces, and that some algorithms are more sensitive to this effect than others. Second, we introduce and measure the performance of a new precise algorithm for generating GC traces which is over 800 times faster than the brute force method. Our algorithm, called Merlin, frequently timestamps objects and later uses the timestamps of dead objects to reconstruct precisely when they died. It performs only periodic garbage collections and achieves high accuracy at low cost, eliminating any reason to use granulated traces.
In this work we have also considered the more theoretical side [HIM02]. While the design of garbage collection algorithms has come of age, the analysis of these algorithms is still in its infancy. Current analyses are limited to merely documenting costs of individual collector executions; conclusive results, measuring across entire programs, require a theoretical foundation from which proofs can be offered. A theoretical foundation also allows abstract examination of garbage collection, enabling new designs without worrying about implementation details. We propose a theoretical framework for analyzing garbage collection algorithms and show how our framework could compute the efficiency (time cost) of garbage collectors. The central novelty of our proposed framework is its capacity to analyze costs of garbage collection over an entire program execution.
The theoretical framework inspired the new Merlin trace generation algorithm. The central theoretical result, achieved using the framework, is a proof that Merlin's asymptotic running time is optimal for trace generation.
We investigated using analytical models for object lifetimes, and used Merlin to measure object lifetimes. We then use these measurements to design collection algorithms and optimizations.
Analytical models of memory object lifetimes are appealing because they would enable mathematical analysis and fast simulation of memory management behavior of programs [SMM00]. We investigate models for object lifetimes drawn from programs in object-oriented languages such as Java and Smalltalk. We explore postulated analytical models and compare them with observed lifetimes for 58 programs. We find that observed lifetime distributions do not match previously proposed object lifetime models, but loosely agree in salient shape characteristics with the gamma distribution family used in statistical survival analysis for general populations.
By using the Merlin fast object-event tracing mechanism, we were able to construct fully accurate object lifetime traces of a number of large standard benchmark programs for Java. For example, Figure 1 plots the bytes of objects live in the heap (y-axis) versus total allocation, where each line represents time of allocation in a perfect age-ordered heap. Thus, it only includes reachable objects. The distance between the top and second line represents immortal objects allocated at the beginning of the program. When a line drops or drops out, it shows object deaths. We then analyzed the traces to see if lifetimes of objects could be predicted from information that can be available at the object allocation site at the time of allocation [ISF03]. We found that in a surprising number of programs (though not all) very accurate prediction is possible, using only a modest amount of information (a short prefix of the method call stack). Note that since the traces we base our analysis on are fully accurate, we were for the first time ever able to analyze a prediction model in which the prediction is of the finest granularity possible - the least unit of object allocation.
We created a new garbage collection framework, GCTk, (2001-2002) in which we implemented most of the collectors and optimizations we describe below. When we began this project, the Jikes RVM included a number of well performing classic collectors, but their monolithic and independent implementations made them poorly suited for further development of new collectors with different needs and fair comparisons. Our previous toolkit work [DMH92] also was not general enough to build our new collectors. We thus designed and implemented GCTk, a general Garbage Collection Toolkit for Java [DMH92,BJMM02,BSH+01,SHB+02]. This framework provided a single shared implementation of key functions such as scanning and remembered sets. In addition, GCTk implemented copying age-based collectors by separating the collection increment from the heap position, but it did not include free-list managers, nor could it mix and match policies to make hybrid collectors.
Based on that experience, we designed a next-generation garbage collection framework, JMTk (2002-present), the Java Memory Management Toolkit [BCM03]. JMTk improves upon GCTk in a number of ways. It creates a cleaner, and we believe portable language/compiler interface. JMTk uses a composable design such that we can easily mix and match policies and mechanisms. It also adds free-list memory managers (e.g., mark-sweep and reference counting), a large object space, the composition of copying and free-list spaces, and dynamic flexible boundaries between different allocation spaces.
The goals of JMTk are to provide an efficient, composable, extensible, and portable toolkit for quickly building and evaluating new and existing garbage collection algorithms. Our design clearly demarcates the external interface between the collector and the compiler for portability. For extensibility, JMTk provides a selection of allocators, garbage identification, collection, pointer tracking, and other mechanisms that are efficient and that a wide variety garbage collection algorithms can compose and share. For instance, our mark-sweep and reference counting collectors share the free list implementation. We recently used JMTk perform a comprehensive and detailed study of collectors including copying, mark-sweep, reference counting, copying generational, and hybrid generational collectors using JMTk in the Jikes RVM on a uniprocessor. We find that the performance of collectors in JMTk is comparable to the highly tuned original Jikes collectors. In a study of full heap and generational collectors, we confirm the significant benefits of generational collectors on a wide variety of heap sizes, and reveal that on very small heaps, collection time is enormous. These experiments add other new insights, such as, that for a variety of generational collectors, a variable-size nursery which is allowed to grow to fill all the space not in use by the older generation performs better than a fixed-size nursery. We thus establish the utility of extensive collector comparisons and the benefits of our design. This work is submitted for publication.
In early 2003, the JMTk collectors became the default ones in the Jikes RVM distributions. We know of researchers at Iowa State, Northeastern, UCSB, and McGill that have used our toolkits for their research.
This section highlights our work on the following garbage collection algorithms:
We build on previous work that shows the following are true for many programs. (1) The weak generational hypothesis: young objects die quickly and thus should be the target frequent collection [LH83,Ung84]. (2) The corollary: the few objects that do not die young are likely to be very long lived and so should be avoided as collection targets. (3) Copying tends to improve locality by decreasing the memory footprint. These observations underlie and explain the success of generational and generational copying collectors. Until recently, the best performing copying garbage collectors used a generational policy, which repeatedly collects the very youngest objects, copies any survivors to an older space, and infrequently collects the older space.
Inspired in part by our object lifetime analysis, we add to these previous ideas, and combine them with the following new observations. (1) Older-first observation: because the object survival function is non-increasing, the longer you wait the more likely an object is garbage. (2) Copy reserve observation: we can reduce the copy reserve through incremental collection. (3) Connectivity observation: connected objects tend to die together. (4) Mutability observation: old objects are mutated much less frequently than young ones. We have used these insights to drive the invention of a wide variety of new garbage collection algorithms, a new family of collectors, and optimizations, including Older-First, Beltway, Connectivity GC, Mark-Copy, GenRC, and Sapphire which we now describe.
We build on our previous study, based on garbage collection simulation, that points to potential improvements by using an Older-First copying garbage collection algorithm[SHB+02,SMM99]. The Older-First algorithm sweeps a fixed-sized window through the heap from older to younger objects, and avoids copying the very youngest objects, which have not had sufficient time to die.
The operation of Older-First is illustrated in Figure 2. The A and C label regions that are targets for Allocation and Copying, respectively. Allocation proceeds at the right (young) end of the A region. Garbage collection copies reachable objects from a window-full at the left (old) end of the A region to the right (young) end of the C region. Allocation then continues until the total space allocated has reached the allowed heap size, and one collects another window-full. Eventually, the collector will reach the end of the A region; when it does, it reverses the roles of A and C, giving it a full A region from which to collect, and an empty C region into which to copy.
We developed and examined an implementation of the Older-First algorithm in the Jikes RVM [SHB+02]. This investigation showed that Older-First can perform as well as the simulation results suggested, and that it greatly improves total program performance when compared to using a fixed-size nursery generational collector. We further compared Older-First to a flexible-size nursery generational collector, in which the nursery occupies all of the heap that does not contain older objects. In these comparisons, the flexible-nursery collector is occasionally the better of the two, but on average the Older-First collector performs the best.
In 2002, we combined the older-first observation with generational principals to create a new framework for copying collection that generalizes, extends, and subsumes all previous work on copying collection [BJMM02]. We call this new family of collectors Beltway, and using Beltway configurations designed new GC algorithms that significantly outperform previous work.
We designed and implemented a new garbage collection framework that significantly generalizes existing copying collectors. The Beltway framework exploits and separates object age and incrementality. It groups objects in one or more increments on queues called belts, collects belts independently, and collects increments on a belt in first-in-first-out order, as shown in Figure 3. We show that Beltway configurations, selected by command line options, act and perform the same as semi-space, generational, and older-first collectors, and thus Beltway encompasses all previous copying collectors of which we are aware.
What Figure 3 illustrates is a configuration with three belts, numbered 0, 1, and 2. In each belt, there are a number of increments, with the oldest one shown to the left and the youngest to the right. Allocation proceeds in the youngest increment of Belt 0, until that belt's space limit (or the overall heap size limit) is reached. Then Beltway collects an oldest increment from the youngest belt that has a full increment. Survivors go to the youngest increment of the next higher belt (in the case of the highest belt, to the end of that same belt). The figure illustrates three successive collections. By controlling the number of belts and their individual space limits, we obtain the wide variety of copying collectors claimed.
The increasing reliance on garbage collected languages such as Java requires that the collector perform well. We show that the generality of Beltway enables us to design and implement new collectors that are robust to variations in heap size and improve total execution time over the best generational copying collectors of which we are aware by up to 40%, and on average by 5 to 10%, for small to moderate heap sizes. New garbage collection algorithms are rare, and yet we define not just one, but a new family of collectors that subsumes previous work. This generality enables us to explore a larger design space and build better collectors. Our future GC work will further explore and exploit this framework.
Partitioning by age works well for many programs because younger objects usually have short lifetimes and thus garbage collection of young objects is often able to free up many objects. However, generational garbage collectors are typically much less efficient for longer-lived objects, and thus prior work has proposed many enhancements to generational collection. We explored whether the connectivity of objects can yield useful partitions or improve existing partitioning schemes [HDH01]. We looked at both direct (e.g., object A points to object B) and transitive (e.g., object A is reachable from object B) connectivity. Our results indicate that connectivity correlates strongly with object lifetimes and death times and is therefore likely to be useful for partitioning objects.
We used our connectivity information to design and implement a new family of garbage collectors, Connectivity-Based Garbage Collection (CBGC). A key distinguishing feature of CBGC is that rather than partitioning objects by age (as is done by generational and other age-based collectors), CBGC partitions objects based on their connectivity. If an object in one partition may point to an object in another partition, CBGC maintains an edge between the partition. This conservative connectivity information has many uses: (i) CBGC can perform partial collection without write barriers; (ii) CBGC can collect independent partitions in parallel with minimal synchronization; (iii) CBGC never requires a full heap collection and thus can have a smaller memory footprint.
We explored a wide range of alternatives for CBGC using a detailed garbage collector simulator that we developed for this study. This study directly built upon the Merlin garbage collector simulator developed (see Section 1.1). Our study indicates that even a simplistic CBGC can outperform generational garbage collector for many important metrics (e.g., memory footprint). We are now using the results of this study to guide the implementation of CBGC in the Jikes RVM [HDH03].
Copying garbage collectors have several advantages over non-copying collectors, including cheap allocation and prevention of fragmentation. However, in order to provide completeness (the guarantee to reclaim each garbage object eventually), standard copying collectors require space equal to twice the size of the maximum live data for a program. Although Beltway mitigates this cost when configured with small increments, it can incur this space overhead when it guarantees completeness.
We recently developed the Mark-Copy collection algorithm which extends generational copying collection and significantly reduces the heap space required to run a program while still guaranteeing completeness [SM03]. The new collector runs in space that is nearly a factor of two smaller than that required by standard copying garbage collectors, increasing the range of applications that can use copying garbage collection. We find that when the collector is given the same amount of space as a generational copying collector, it improves program execution time of Java benchmarks by 20-85% in tight heaps and by 5-10% in moderate size heaps.
We also compare the performance of the mark-copy collector with that of a mark-sweep collector. We find that, for several benchmarks, the mark-copy collector can run in heaps comparable in size to the minimum heap space required by the mark-sweep collector. We also find that for some benchmarks the mark-copy collector is significantly faster than a mark-sweep collector in tight heaps. This work is submitted for publication .
General purpose garbage collectors have yet to combine short pause times with fast throughput. For example, non-concurrent generational collectors can achieve high throughput and have modest average pause times, but occasionally collect the whole heap and consequently incur long pauses. On the other extreme, concurrent collectors, including reference counters, attain short pause times with significant performance penalties. We develop a new collector GenRC, which combines copying generational collection and reference counting the older generation to achieve both goals in a uniprocessor setting [BM02]. Key to our algorithm is a generalization of deferred reference counting which allows GenRC to safely ignore mutations to nursery objects. GenRC thus restricts copying and reference counting to the object demographics for which they perform well. The contiguous allocation of a copying collector is extremely fast, and collection time is proportional to the live objects whose survival rates are usually low in a fixed size nursery. Reference counting time is proportional to object mutations and the number of dead objects, both of which are typically low for older objects. We further restrict the time spent reference counting with collection triggers and buffering. We compare a fixed nursery variant of GenRC (FG-RC) with pure reference counting (RC), high performance copying and mark-sweep fixed nursery generational (FG-MS), and pure mark-sweep collectors (MS). We show that FG-RC is competitive with FG-MS on throughput, and attains much lower average and maximum pause times.
The growing use of concurrent systems built in languages that require garbage collection (GC), such as Java, is raising practical interest in concurrent GC. Sapphire is a new algorithm for concurrent copying GC for Java [HM01,HM03]. It stresses minimizing the amount of time any given application thread may need to block to support the collector. In particular, Sapphire is intended to work well in the presence of a large number of application threads, on small to medium scale shared memory multiprocessors.
Sapphire extends previous concurrent copying algorithms, and is most closely related to replicating copying collection, a GC technique in which application threads observe and update primarily the old copies of objects [NOPH92]. The key innovations of Sapphire are: (1) the ability to ``flip'' one thread at a time (changing the thread's view from the old copies of objects to the new copies), as opposed to needing to stop all threads and flip them at the same time; (2) exploiting Java semantics and assuming any data races occur on volatile fields, to avoid a barrier on reads of non-volatile fields; and (3) working in concert with virtually any underlying (non-concurrent) copying collection algorithm.
We also investigated memory management for C and C++ programs, and in the future we hope to incorporate ideas from this work into our Java memory managers [Ber02,BMBW00,BZM01,BZM02].
Parallel, multithreaded C and C++ programs such as web servers, database managers, news servers, and scientific applications are becoming increasingly prevalent. For these applications, the memory allocator is often a bottleneck that severely limits program performance and scalability on multiprocessor systems. Previous allocators suffer from problems that include poor performance and scalability, and heap organizations that introduce false sharing. Worse, many allocators exhibit a dramatic increase in memory consumption when confronted with a producer-consumer pattern of object allocation and freeing. This increase in memory consumption can range from a factor of 4#4 (the number of processors) to unbounded memory consumption.
We introduces Hoard, a fast, highly scalable allocator that largely avoids false sharing and is memory efficient [Ber02,BMBW00]. Hoard is the first allocator to simultaneously solve the above problems. Hoard combines one global heap and per-processor heaps with a novel discipline that provably bounds memory consumption and has very low synchronization costs in the common case. Our results on eleven programs demonstrate that Hoard yields low average fragmentation and improves overall program performance over the standard Solaris allocator by up to a factor of 60 on 14 processors, and up to a factor of 18 over the next best allocator we tested.
We developed the Heap Layers memory management system for C and C++ programs that combines composable heaps with high performance [BZM01]. We used this infrastructure to perform a number of detailed explicit memory management studies. For example, many programmers use custom memory allocators hoping to achieve performance improvements. We performed the first in-depth study of eight C and C++ applications that use custom allocators [BZM02]. Surprisingly, for six of these applications, the Lea allocator, a state-of-the-art general-purpose allocator performs as well as or better than the custom allocators. The two exceptions use regions, which deliver higher performance (up to 44% faster). Regions use bump-pointer allocation, and en-masse freeing. Regions thus reduce programmer burden and help eliminate a source of memory leaks. We show, however, that the inability of programmers to free individual objects within regions can lead to a substantial increase in memory consumption (up to 230% more).
To eliminate this excessive memory consumption, we develop the Reap extended general-purpose memory allocator, which combines region semantics with individual object deletion. Reap outperforms similar allocators and comes close to matching region performance while providing the potential for reduced memory consumption. We thus demonstrate an implementation of an extended memory allocation interface that eliminates the need for most custom memory allocators and the programmer effort needed to build and maintain them.
Acknowledgments and Disclaimer
This work is supported by NSF ITR CCR-0085792 (this grant). In addition, this work is synergistic with the following grants and support from NSF ACI-9982028, NSF EIA-9726401, NSF CDA-9502639, NSF Career CCR-0133457, NSF IIS-9988637, an IBM faculty partnership award, DARPA F30602-98-1-0101, DARPA F33615-01-C-1892, DARPA F33615-01-C-1894, IBM, Microsoft, and Compaq. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.
For questions or comments contact hoffmann@cs.umass.edu.Copyright 2001-2003 by the DaCapo
Project,
All Rights Reserved.