Abstract
Android systems typically run on resource-constrained hand-hold devices. How to efficiently utilize Java heaps is one of the most important issues of concerns to the developers. Developers often use profilers to observe the utilization efficiency of Java objects, hoping to find out memory allocation bottlenecks, identify and solve problems such as memory leaks, etc. However, currently there lacks a low-overhead and efficient Java object profiler on Android and its Java virtual machines.
In this paper, we design and implement a novel and low-overhead Java object profiler based on the Address-Chain technique, on Android 6.0 and its ART virtual machine, which uses an AOT (ahead-of-time) compiler and has complex garbage collection algorithms. Our profiler records the allocation site, the class information of the object, the object size, the birth time and death time of the object, the physical memory trace of the object movements with time stamps, the last access time and the access regular pattern, etc., for every Java object. The data profiled can help the developers to detect memory leaks, implement optimizations like pretenuring and tune the performance of garbage collector, etc.
The Java object profiling mechanism proposed in this paper has low execution time overhead, imposes no overhead on the Java heap and does not modify any existing key data structure of the ART Virtual Machine, including the object layouts, class layouts and any others. By caching object access event in global register and removing redundant instrumentation, on Nexus7 and Android 6.0, the read/write barriers overheads of the profiler are about 19% on average for EEMBC, SciMark and other workloads. The I/O overheads are about 28% and the total execution overheads are about 51% on average.
Introduction
In recent years embedded devices have become more and more popular. Mobile phones and tablets, which often use Android system, are the most popular electronic devices. In the latest version Android system uses ART virtual machine to execute Java programs and Android applications. Managed languages such as Java, C#, Python and PHP have garbage collection to manage all the objects at runtime, which (1) reduce the memory-related failures such as dangling pointers, double frees and buffer overflows, (2) improve the efficiency of the usage of memory heaps and (3) eliminate memory leaks due to unreachable objects. But object-oriented Android applications still exist memory problems which are difficult to be located. For example, developers may keep references to objects but don’t use them, or create and destroy large number objects of the same type. The resources we can use in embedded systems are limited, and developers need to allocate these resources carefully. Memory is one of the scarcest resources for embedded system. Problems in an application could degrade the whole system performance, making devices not running smoothly and battery draining too fast. That hurts software reliability and user experience. Sometimes, we have a need to know the lifetime of every object to find ways to optimize our programs.
Some researchers have tried to profile memory behavior of programs on the ART virtual machine [11, 40]. Developers can also use HPROF to generate a snapshot of Java heaps for Android applications [17]. However, these tools don’t have the ability to provide the life cycle and access information for Java objects. Other researchers have implemented tools to profile Android system and applications in IO overhead [23], privacy leak [1, 18], application behavior [39], resource leak [42, 55], energy consumption [12], etc. There are many researches and tools in other platform to track information of Java objects [5, 53]. But these tools are not designed for ART virtual machine and not available in Android platform. Sometimes they incur both high space and time overhead. Therefore, how to profile and optimize programs in Android is remaining a challenge.
In this paper, we present a low-overhead and efficient Java object profiler on ART virtual machine using the extended Address-Chain profiling method [38]. Object events (e.g., object allocation, access, movement and death) are tracked online and recorded into files. Unlike other Java virtual machines, ART virtual machine uses AOT(ahead-of-time) compilers. Based on the features of ART virtual machine, we encode the object access event to a single bit in the bitmap mapping the heaps and add instrumentation at object uses at compiling time before Android programs first running. By caching accessed object in global register and removing fully redundant instrumentation at object uses, the read/write barriers overheads of our profiler are about 19% on average for EEMBC, SciMark and other workloads, which is reasonable. Our redundant elimination method is similar to Bell [5] but is more effective. Then, we use an offline analyzer program to construct the Address-Chain for every object, in that each object event is paired with a time stamp(the number of times garbage collector has been invoked). From the Address-Chain of object o we can get all the events happened to o. The profiled result provides object information like object lifetime, last access time and access frequency, which are valuable for finding memory leaks [5–7, 54] in Java programs and many runtime optimization techniques, such as object tenuring [3, 37], reusing [48, 49] and compressing[24, 35].
In summary, our contributions are: A low-overhead and effective Java object profiler on ART virtual machine is proposed. There is NO overhead in Java heaps. It does not modify the object layouts, class layouts and any other key data structures of the ART virtualmachine. Our profiler profiles the accurate life cycle, the allocation site, the access event and the physical memory trace of object movements with time stamps for every Java object. From these information we can find memory leaks, optimize programs and tune garbage collectors. We have implemented our profiler in two parts: an online object events tracer on the ART virtual machine and Android 6.0 and an offline object events analyzer. On Nexus 7, the overheads of read/write barriers of the profiler are about 19% on average for EEMBC, SciMark 2.0, CaffeineMark 3.0 and Deltablue [10, 36]. The I/O overheads are about 28% and the total execution overheads are about 51% on average.
Related work
Profilers. Commercial profilers [19, 53] usually provide the general runtime information of programs and enable visualization of Java heap objects of different types. But these commonly-used java profilers often disagree on certain profiled result [27] and don’t have the ability to tell us how memory leak happens. Profilers proposed by researchers vary in their ability and goals [16, 38]. Merlin [16] uses collected timestamps of objects to compute objects’ lifetime efficiently. Elephant Tracks (ET) [33, 34] produces a comprehensive program trace to help prototype new GC algorithms and conduct program analysis by exploiting the Merlin algorithm. OEP [24] records all objects in a particular execution and partitions them into equivalence classes offline. It finds objects which can be merged into one in the equivalence classes without changing the traced program run’s behaviour and optimizes this program by replacing equivalent objects with one. Odaira et al. [31] profile object access events and use the feedback to optimize the usage of certain array objects. Merlin, ET and OEP impose high runtime overhead, while Address-Chain [38, 54] have a much lower overhead. We extend Address-Chain method by adding access event to record the whole lifetime of objects.
Above tools don’t support Android platform. Programmers can use HPROF [17] to get a snapshot of the whole heap of running program in Android. Su et al. [40] design a framework for Dalvik virtual machine to analyze the android runtime system. They use this framework to identify whether a bottleneck occurred in the application level, the Linux user-space level, or the Linux kernel level but not for profiling Java objects. Chang et al. [11] implement a profiler that monitors the android applications memory behavior. These tools above in Android don’t give the lifetime of objects. Our tool is the first attempt to track object lifetime in ART virtual machine and have reasonable overhead.
Barriers and instrumentation optimization. Barriers are necessary when obtaining the access events of objects [31]. Prior work shows that lightweight barriers can be cheap (5 to 8% overhead on average), but more complex barriers are expensive (15 to 20% on average) [2, 56]. When stealing bits in object to record access event, atomic instruction is often needed to guarantee thread safety. In ARM ldrex and strex are too expensive and they cause over 400% overhead when we try to steal bits. So we use bitmap to record object access information. Bacon et al. uses common subexpression elimination to remove fully redundant read barriers, which reduces average overhead from 6 to 4% [2]. Bell uses data-flow analysis to remove fully redundant instrumentation [5], which reduces average overhead from 29 to 22%. Since our barriers include more instructions, we implement much more aggressive redundancy elimination, including caching accessed object in global register and removing fully redundant instrumentation at object uses using the data-flow analysis result based on SSA form.
Static analysis. Static program analysis is the analysis of computer software that is performed without actually executing programs and has no runtime overhead. Relda2 [42] performs inter-procedural analysis to find the resource leak in Android programs. Thresher [4] uses path-sensitive static analysis to reason about whether an object can be reached from another variable or object via pointer dereference. The heap reachability information is used to detect Android memory leaks. LeakChecker [51] identifies leaked objects created inside a loop while being referenced by other objects outside the loop using a static analysis approach but requires developers to provide suspicious loops manually. Yang et al. [52] develop a control-flow representation of user-driven callback behavior, using context-sensitive analysis of event handlers.They also develop a client analysis that builds a static model of the application’s GUI. However, static analysis often reports false positives since it must make conservative assumptions about control flow and some information, such as the number of objects and dynamic class loading, may not be available at analyzing time. Sometimes we must use a dynamic way to analyzeprograms.
Object’s lifetime and last access time. Knowing object’s lifetime can help us find performance problems by certain technique. For example, Xu [48] optimizes programs by reusing data structures that have disjoint lifetimes. Resurrector [49] uses a modified reference counting strategy which is much faster than Merlin to profile object lifetimes and finds reusable data structures. The last access time of objects is crucial for many memory leak detection to employ dynamic analysis [5–7, 54]. Melt and LeakSurvivor [6, 41] identify stale objects and transfer them to disk, then get objects back when programs need them. Sleigh and leak pruning [5, 7] use two and three-bit stale counters respectively to detect stale objects. When memory is used up, Sleigh reports the last use sites of leaked objects while leak pruning prunes them. Container profiling [43] concentrates on leaks in containers. When calculating leak confidence, Container profiling takes container’s size into consideration while LeakTracer [54] takes every object’s size into consideration. LeakTracer needs much less time than others to find leaks(usually dozens of minutes). Though these tools have different approaches, they all rely on the staleness (the last access time) ofobjects.
Bloat analysis. Software bloat analysis [9, 50] is a more general problem, attempting to find, remove and prevent performance problems. For example, PerfBlower [15] provides a specification language ISL to describe general performance problems that have observable symptoms and an automated test oracle via virtual amplification. Developers can use this performance testing framework to test Java programs to find memory-related performance problems. The techniques described before about finding memory leaks can also be considered as bloat analysis.
Profiling object information
This section first gives a brief introduction to the Address-Chain profiling method, then describes how to record object access events using bitmap. Additional information, such as object size and type, are recorded to help us profile objects more precisely. Section 3.3 shows how to extend the Address-Chain profiling method to do this.
Address-chain technique
A simple description of Address-Chain is given to help discussion here, more details can be found in [38].
An Address-Chain of an object o is a vector of pairs of the following form
In which, T i stands for the number of times garbage collector has been invoked. (AllocS, T d ) means that the allocation site of o was AllocS and o was dead at time T d , and (T0, Addr0) means that o was created at time T0 and the initial physical address of o was Addr0. (T i , Addr i ) means that garbage collector moves o to address Addr i at time T i . As objects could not be moved while virtual machine using a mark-sweep garbage collector, an Address-Chain vector would only have one pair (T0, Addr0) under this scenario. For compacting, copying and generational garbage collectors, objects could be moved among different areas. When an object has been moved, a new pair (T i , Addr i ) will be appended to its vector. No matter whether the garbage collector moves objects or not, an Address-Chain uniquely identifies an Object for a particular program execution. For example, 〈(0x6DA342B0, 8) , (1, 0x130000D0) , (4, 0x12CD73C0) 〉 represents that an object was allocated at memory address 0x130000D0 from the allocation site 0x6DA342B0 in the first GC cycle (i.e., between the first GC and the second GC), and then was moved to address 0x12CD73C0 during the fourth GC. Finally this object was dead after the eighth GC.
Encoding object access event
There are researches to steal bits in object header or change the layout of the object while recording object information [5, 54], such as access events, allocation site. Programs in ART have limited resources, that could introduce significant overheads to garbage collectors and the ART virtual machine if we add additional fields to object. Stealing bits in objects often need using atomic instructions to guarantee the thread safety. We have tried to steal bits in the object’s hash code using ldrex and strex, but this sometimes adds huge overheads and EEMBC runs 400% slower in out tests.
In this paper, we present a way using a bitmap mapping Java heaps to track object access event. As in Fig. 1, k bytes allocated by Java Virtual Machine correspond to a bit in the bitmap. K is the alignment of an object. When an object o is accessed, we set the corresponding bit to 1. If a GC occurs, the bitmap is dumped to file and all the bits are all set to 0. Only when objects mapped to the same byte are accessed in multi threads at the same time, we may loose some access information. The address of corresponding bit of object o can be calculated as

Bitmap mapping.
There is no need to scan every object in Java heap, we can simply dump the bitmap and deduce object access information from this bitmap. Our recording method does not add any space overhead in Java heap and modify the layouts of Object. These making our method easier to be ported to other platforms.
To record more information about objects and access events, we extend the Address-Chain to a two-dimensional vector as
Type and Size indicate the object’s type and size. T ai means that this object is accessed at time T ai , the last access time of objects can be deduced from the access time sequence. During the program execution the Address-Chain of an object does not exist, we use the recorded object events to build it in our offline analyzer. When an object o is allocated, we record its allocation site, physical address and type into file. If o is a non-array objects, the size of o is implicit in its type. If o is an array object, array objects can have different length even they are the same type so we record its size additionally. As for the object access events, we can calculate the access information from the dumped bitmap. The time stamp of each event can be deduced from GC starts and GC end events, which we do not present here.
Implementation
Online object events tracker
We implement the object events tracker on top of the ART virtual machine in Android 6.0, as shown in Fig. 2. We instrumented ART’s concurrent mark-sweep garbage collector and semi-space garbage collector to track object allocation, movement and death events. We instrument the Quick ahead-of-time compiler to acquire object access events.

Framework of the Profiler.
1) Garbage Collector Instrumentation ART virtual machine splits Java heaps into ImageSpace and AllocSpace, Java objects created by users are allocated in AllocSpace. AllocSpace are split into ZygoteSpace, MainAllocSpace and LargeObjectSpace. MainAllocSpace can be one of MallocSpace (correspond to concurrent mark-sweep GC and mark-compact GC), BumpPointerSpace (correspond to semi-space GC) and RegoinSpace (correspond to concurrent-copying GC). Android programs will create objects mainly on MainAllocSpace of which the object alignment is 8 bytes. Primitive array objects or string objects that are larger than 3 pages can be allocated in LargeObjectSpace, of which often be used to store resources like images. CMS (Concurrent mark-sweep garbage collector) and semi-space garbage collector are the most used garbage collectors in the ART virtual machine. Now we only instrumented CMS and semi-space garbage collector. Our profiler will support tracking lifetime in mark-compact garbage collector and concurrent-copying garbage collector. We track object allocation, movement, death events for AllocSpace and access event for MainAllocSpace. In other words, we only track the information of objects allocated by users.
CMS uses live bitmap to track live objects and has full, partial and sticky three different levels to collect objects. Objects could not be moved while using CMS, we only need record objects allocation and death events. For semi-space garbage collector, it is easy for us to record object allocation and move events. Objects except for moved objects during garbage collection are death objects, so we don’t need to record dead objects. We choose to record boundaries of BumpPointerSpace and identify dead objects in the offline analyzer program while using semi-space garbage collector. While an object is allocated, the process will jump from compiled codes to stub codes, we simply record the value LR (link register) as the allocation site.
2) AOT Compiler Instrumentation ART virtual machine uses AOT compilers to compile Java programs before running them. We instrumented the Quick compiler. Before programs are actually running, we have inserted the instrumentation at object access points. If an object is read or written to, we treat this object as accessed. We use bitmap to record object access events. Before every GC finishes, we dump the bitmap to a single file and clear all bits in bitmap. Similar to [24, 54], we add read or write barriers when objects get accessed at the following points:
field read/write array element read/write array length read method invocation lock acquisition and reference comparsion
While running, ART virtual machine may switch the type of the garbage collector. When recording the GC mark we also record the GC type into file. CMS is a concurrent garbage collector, objects will get accessed even garbage collector is running. We choose to dump the access bitmap when the GC finishes to track the complete object access events. In this situation, we will get access events for dead objects. Using semi-space garbage collector, if we dump bitmap when the GC finishes we will record access events for dead objects and moved objects. As the access bitmap was dumped into a single file, we can simply process access events before calculating object movement and death information in this GC cycle to avoid these problems.
Optimization
We add instrumentations at object uses (object reads and writes), which include several load, store and bit manipulations. The instrumentations added execute frequently and can be costly. Not only the programs will run slower, but also the AOT compiling result file will be larger. We use caching technique and removing redundant instrumentation to reduce execution overheads and the size of compiledresult.
Basic idea
The last access time based on GC times is precise enough to find memory leaks and give a good performance [54]. So the access event is tracked by GC times in our profiler. In the interval of GC, whether one object was accessed many times or just once, the record bit was the same 1 and could only indicate that this object was accessed in this execution time. As the pseudo code shown in Fig. 3 (a), a and b refer to the same object. All these three object access events will set the same bit to 1. We choose to eliminate recording redundant access events as much as possible. Our thought is that if an object is accessed before, we should not record the following access events of this object.

(a) Object access pseudo code. (b) Caching object access events in global register. (3) Algorithm for removing redundant instrumentations while compiling codes in a basic block.
If an object o is accessed, then o is likely to be the next accessed object. We choose to use a global register to cache the previous accessed object’s address. In RISC architecture like ARM, there are plenty of general purpose registers. The Quick compiler is modified to not use r8 register so that we could use r8 to cache the address of accessed object. As in Fig. 3 (b), if o is the previous accessed object, then r8 stores the address of o. If o is accessed again, this tracked object access event won’t set the corresponding bit. Else if r8 not equals to o, then r8 is set to the address of the new accessed object and this object access event is recorded in the corresponding bit.
Removing redundant instrumentation
ART virtual machine is register-based, and frames are fixed in size upon creation. Each frame consists of a particular number of registers (specified by the method). The registers used by Dalvik bytecode are actually virtual registers. If a virtual register r contains an object o and o is accessed at virtual register r. We say r is accessed in this situation. In the following codes if r still contains o and r is accessed again, we treat the second access event as redundant and don’t insert any instrumentation. We add a bit vector for every basicblock to record which virtual registers are accessed and not assigned to other values before. The length of the bit vector is the size of the virtual registers the method has. The corresponding bit of a virtual register presents whether instrumentation can be removed at the next access event for this virtual register in this basicblock.
The manipulations of a virtual register can be divided as follows:
As in Fig. 4(a), we present our method for dealing v5 (virtual register 5) in this basicblock. Function a (r) means accessing the virtual register r, c (r) means changing the value of the virtual register r. R means recording access event here, O means omitting recording. For the first a (v5), we turn the corresponding bit from 0 to 1 and insert instrumentation. For the second a (v5), the corresponding bit is 1 and we don’t need insert instrumentation here. And then c (v5) turn the corresponding bit from 1 to 0. For the final a (v5), we need to add instrumentation here and turn the bit from 0 to 1 because the correspondingbit is 0.

Removing redundant instrumentations.
If a virtual register is accessed in its dominator basicblocks and won’t be assigned to another value before this basicblock, we could turn the corresponding bit to 1 before compiling this basicblock. While compiling methods, Quick compiler will do SSA transformation to optimize codes and do SSA renaming to generating assembly code on virtual registers. It is easy for us to get the data flow information from the SSA form. In Quick compiler, the dominator basicblocks of a basicblock will be compiled first. While compiling a basicblock, the bit vector of this basicblock will first be assigned to the value of the bit vector of the immediate dominator basicblock. Then all the bits corresponding to the virtual registers which have assignment from other paths besides from the immediate dominator basicblock are turned to 0. These virtual registers which turn corresponding bits to 0 are actually the virtual registers in φ nodes. After initializing the bit vector, we start compiling this basicblock.
As in Fig. 4 (b,c), the immediate dominator basicblock of b is a, and b is the immediate dominator basicblock for c, d, e. In Fig. 4 (b), a (v5) in basicblock a will help us removing instrumentation in basicblock b, c, d, e. While in Fig. 4 (c), c (v5) in basicblock c let us can not removing instrumentation in basicblock b, c, e.
The algorithm of removing redundant instrumentation is in Fig. 3 (c). Our method is similar to Bell [5] and partial redundancy elimination (PRE) analysis [8], but is simpler by using the result of SSA transformation. We don’t add instrumentation at redundant uses because we don’t need to record it. If a method runs a long time, some in-use objects may be reported as stale by our profiler due to removing redundant instrumentation. However, this effect can only happen to an object referred by a virtual register continuously from the instrumented use to the uninstrumented use. As with Bell, this don’t cause inaccuracy in practice.
Methodology
Execution. In Android 6.0, ART virtual machine uses both Quick compiler and Optimizing compiler to compile programs. Since we have not modified the Optimizing compiler, we configure the ART virtual machine to use the Quick compiler only. While programs running, we only use CMS garbage collector to collect objects. We execute each benchmark with a minimum possible heap size for that benchmark. Every benchmark is ran 5 iterations to get the average result.
Platform. The Android source code we use is Marshmallow – 6.0.0_r1. The target we build is aosp_flo-userdebug and target build type is release. The host OS environment is linux-3.16.0-71-generic-X86_64-with-Ubuntu-14.04. We perform our experiments on Nexus 7(wifi), which is a quad-core machine with a NVIDIA Tegra3(ARMv7) 1.6 GHz processor and has 1 GB RAM and 16 GB ROM.
Benchmarks. We use the EEMBC benchmarks, SciMark 2.0 benchmarks, CaffeineMark 3.0 benchmarks and DeltaBlue benchmark to evaluate the performance of our profiler. The execution configurations of each benchmark is in Table 1.
Configurations for benchmarks
Configurations for benchmarks
Our profiler dumps all the data into files. The only additional execution space is the bitmap which tracks object access event (Section 3.2). On ART virtual machine object’s alignment is 8 bytes. The size of a bitmap is 1/64 of the corresponding Java heap. In this way, our profiling method neither add any overhead to the Java heaps nor change any key data structure of the ART virtual machine. Table 1 presents the space overhead of the used benchmarks.
Unlike other Java virtual machines, the ART virtual machine uses AOT compilers and all the Android programs will be compiled before they are executed. The compilation does not add any execution overhead, so all the benchmarks we ran don’t need warm-up. The algorithm we describe in Fig. 3 (c) is a one-pass approach to insert instrumentation while compiling code. All the information we need can get from the SSA form transaction. The compilation time overhead of our benchmarks is low and not significant. We add a bit vector to every basicblock when this basicblock is being compiled. The length of the bit vector equals the number of virtual registers in this method. Most Java methods written by developers often have less than 16 virtual registers if they are compiled to Dalvik bytecodes. The space overhead of compilation is pretty low.
Time overhead
Figure 5 shows the execution overhead added by our online tracker on each benchmark using the Quick compiler and CMS garbage collector. The average overhead imposed on this benchmarks are about 51%. Through our optimization, barriers at object access have 19% overhead on average. Dumping the recorded events into files contribute 28% of the total overhead. Chess, kxml and DeltaBlue have a much higher I/O overhead than other benchmarks. One reason is that these benchmarks allocate more objects. Another reason is that we configure a minimum possible heap size. GC occurs frequently. We dumped huge object creation and death information into files. We don’t use thread-local profiling buffers described in original Address-Chain technique [38]. We need modify lots of codes on in Java virtual machine to use thread-local buffers, which may cause errors easily. We can not get the order of recorded events in different threads.

Runtime overheads of the profiler.
Figure 6 shows the different barriers overheads at different optimizing levels. None is execution time without any optimization and has 61% average overhead. Caching only caches the the previous accessed object address and has 50% average overhead. Elim _ all is the execution time with redundant instrumentation optimization and has 25% average overhead. All means using all optimizations. Our profiler saves 42% of total execution overhead by using caching object access event in global register and removing redundant instrumentation.

Overheads of barriers with or without optimizations.
Figure 7 presents the allocation and collection process in an 16kb space int the heap of chess benchmark.The black regions mean that objects is allocated in these areas at heap. The gray regions mean that objects in these areas are collected. Since we output the object creation, death and movement information to files immediately, we can recovery these events at the order of their occurrence in the offline analyzer program. We can replay all the object allocation, movement and collection by the order of time using their accurate physical address in memory. This replay can help developers understand more about garbage collectors and choose the suitable garbage collector for their programs. The garbage collector researchers can have a visualization of how their algorithm works on real programs and improve their algorithm.

Replay of allocating and collecting process. Black areas stand for allocated objects, gray areas stand for dead objects.
Execution information. Table 2 gives a general information of each benchmark, such as the total number of allocated objects, allocation sites and GC times. Objects in these benchmarks are almost collected during the coming GC except Scimark . large, in which objects have a long lifetime. The Topsite means that this allocation site allocate most objects among all the allocation sites for each benchmark. Tables 3 and 4 shows the top sites and the top classes of Chess. Chess is a program that simulate how chessmen move. In Chess more than half of the allocated objects have type com . sun . mep . bench . Chess . Point. A Point object stands for a point in the chessboard. As in Fig. 8, Point objects are almost all collected in every GC. There may have opportunity to reuse Point objects.
Profiled information of benchmarks
Profiled information of benchmarks
Top sites of Chess
Top classes of Chess

Allocation and collection numbers of com . sun . mep . bench . Chess . Point in the execution.
Finding memory leak. We use our profiler to find the memory leak in two third-party examples SwapLeak and ListLeak. These examples leak memory rapidly. We use a leakpredictor based on staleness and the object size in LeakTracer [54]. Using the last access and size of objects, we successfully find the leak types of these examples.
In this paper, we proposed a low overhead and efficient profiler to profile Java objects on the ART Virtual Machine based on Address-Chain profiling method. Our profiler reports per-object source information such as the allocation site, last access time and physical memory trace. Our profiling method don’t introduce any overhead to Java heaps, nor modifies any existing key data structure of ART virtual machine and Android system. Using global register caching and redundant instrumentation elimination, we reduce 42% barriers overheads and the total execution overheads of our profiler are reasonable. The profiled result gives us sufficient information to find memory leak and help optimizations, such like pretenuring and reusing objects.
Footnotes
Acknowledgments
This work was supported by National NaturalScience Foundation of China grants No. 61272166, the State Key Laboratory of Software Development Environment of China No. SKLSDE-2016ZX-08, and Huawei Research Fund No. HIRPO-20140405-YB2015080015.
