Async profiler

Async profiler with Andrei Pangin¶

Source video: https://www.youtube.com/watch?v=H6glyrKQlg8&list=PLNCLTEx3B8h4Yo_WvKWdLvI9mj1XpTKBr&index=1

Types of profiling¶

Tracing¶

Usually instrumentation of bytecode
Slow and therefore cannot be used for production

Sampling¶

Periodic snapshots of what the app is doing
Lighter weight than tracing and can be used in production
The sampling period will control the amount of overhead
Andrei uses it in prod for their app

Sampling in depth¶

In Java code: Thread.getAllStackTraces() returns a Map<Thread, StackTraceElement[]>
Then there is a JVMTI method with some extra information
1000 threads:
50ms latency
~10MB results size
Advantages are it is simple VisualVM, YourKit

Demo 1¶

VisualVM is safepoint based, therefore it shows the Thread.currentThread().isAlive() is where all the time is spent.
Then he tries it with JFR and states that JMC showed nothing as it only took one sample at 10s.

Safepoint¶

HotSpot JVM stops all threads to take a thread dump but requires somewhere in the code to do this.
Safepoints allow the JVM to take a thread dump at this point.
Safepoints are usually at:
End of the method
Loops
Long linear piece of code won't appear in the stack trace due to the fact that there is no safepoint to allow this to be captured.

Native methods¶

JVM has no clue when executing native method, as to whether it is sleeping or actually doing real work.
Busy client versus idle client will show up at taking equal amount of time.

Solving GetAllStackTraces problems¶

Avoid SafePoint bias
Skip idle threads
Profile native code correctly

AsyncGetCallTrace API¶

This is an internal HotSpot API to get stack traces from current thread.
Native code to call using signal handler, probably timer handler.
Honest profiler and async profiler used this
API is private and not documented.
Compares with honest profilers against native method example above, which shows the correct values for native methods that aren't doing anything.
Skips idle threads
No safepoint bias
Another JVM option required

Important 01¶

When executing with AsyncGetCallTrace you need to specify -XX+DebugNonSafepoints -XX:+UnlockDiagnosticVMOptions as HotSpot generates debug info only at Safepoints.
This means that for serveral inlined methods, without this flag then you won't see the inlined methods at all.

AsyncGetCallTrace fails¶

There are bugs in teh JVM that prevent this from working each time.
Async profiler has introduced work arounds to traverse the stack correctly in these cases, tricks as the author called them.
See https://bugs.openjdk.java.net/browse/JDK-8178287

PMU¶

Starting with Pentium all chips have hardware performance monitoring.
Hundreds of different counters in modern CPUs.
It can be configured to generate HW interrupt when certain counter overflows. We can use this to do profiling.
Counters = cycles, instructions, cache misses, branch misses

Linux PMU¶

perf_event_open (peo)
Linux syscall, subscribe to HW/OS events
peo also allows some profiling in hardware

perf¶

Example

perf record -F 1009 java
perf report

Takes a sample 1009 times per second, the above is not a typo but uses a prime number rather than 1000 to avoid collision with other schedule events in the system.
We don't see anything about java because perf doesn't understand Java, JIT, etc.
We need to tell perf to understand the mapping
perf-map-agent can collect this information to provide a mapping file for use by perf.
It requires a specific flag to be set -XX:+PreserveFramePointer flag.
perf uses FramePointers
Each stack frame represents function or method.
Base slot of each frame points to previous frame, therefore we can walk through collecting the stack trace.
How to find the first frame? The JVM frame pointer register points to this, but HotSpot uses frame pointer as regular general purpose register by default.
This flag reverses this optimisation, which costs <5% overhead

Flamegraph¶

y-axis = stack trace, higher = deeper stack
Length of rectangle on x-axis = total time spent in this method.
This is sampling so wouldn't be correct to state that this is time, seen more times on sampling.
You can search and it states at the bottom right the number matches in terms of % times seen.
Green = Java method
Yellow = C++ method JVM
Red = native method
Brown = Linux Kernel code

Problems with perf¶

Perf sees interpreter as single piece of machine code, it doesn't understand what interpreted method is being executed, therefore in his example its difficult to understand what method is using the CPU.
Requires special agent
Requires flag to be set
Needs some things set on the Linux kernel level
Perf is limited by default to 127 stack frames
Perf profiles are very large
1000 threads for 60s = GB output file

Async Profiler part 2¶

Video: https://www.youtube.com/watch?v=WnRngFMBe7w&list=PLNCLTEx3B8h4Yo_WvKWdLvI9mj1XpTKBr&index=2

Mixed approach¶

Good to mix approach from perf_event_open for kernel, native, use HW counters and then use the AsyncGetCallTrace API for Java.
This is the method used in async profiler
The tricky part is how to merge the two stacks from both sources.

How to use flamegraphs¶

Look for leaf flames
Look for long leaf flames

Using Async profiler¶

You can attach later but you should use -XX:+UnlockDiagnosticVMOptions -XX+DebugNonSafepoints if you intend to do that.

Demo3¶

FileReader¶

He provided a number of options for buffers to use for reading a file to provide the best IO.
He originally went for 32M as the buffer to use, with the thinking being that we have less system calls, etc.
In the case he tested 16M was more performant than 32M.
The page faults in the kernel were caused by malloc using a different implementation for the size of the allocation. In the case of 32M it involved paging but with a smaller buffer it was ok.

nanoTime¶

Strange behavioiur of nanoTime
When is colleague upgraded his PC, all his benchmarks became much slower.
The reason was nanoTime was much slower on this upgrade.
There will be no difference with a Java only profiler.
With async you can see that there are system calls in brown.
This is related to multiple available clock sources being in Linux.
I found an issue when testing with J8/J11, with and without the DebugNonSafePoints setting.
Example local execution on Linux 5.4.0-58-generic:

Here we see [unknown_Java] which is obviously the call to gettime.

Slow time on EC2 due to clock source being XEN by default, see here: https://aws.amazon.com/premiumsupport/knowledge-center/manage-ec2-linux-clock-source/

Wall clock profiling¶

Can start as an agent
CPU won't always show why your startup is slow.
The time is not always spent on CPU intensive work, may be disk IO, netIO (DB, etc), logs etc
Wall clock profiling handles:
Thread.sleep
Object.wait / Condition.await
Wait to get lock/semaphore
Wait for IO

Lock contention¶

Special mode for lock contention

./profiler.sh -e lock -o flamegraph=total

Counts the amount of time spent acquiring locks
This applies for reentrant locks, etc
flamegraph=total will then show total time in nanoseconds

PAUSED at 39:00 on the second video.

Async Profiler part 3¶

https://www.youtube.com/watch?v=bTDmpwhwy3E&list=PLNCLTEx3B8h4Yo_WvKWdLvI9mj1XpTKBr&index=3

How many objects created?¶

Not always obvious how many objects are being created, even with simple code.
He gives an example about timezones and alloc being out of control.
The bug was the timezone was unknown and therefore it was creating its rules each time.

VisualVM Sampler¶

Sampler = snapshot of histogram of the objects on the heap at that point.

Profiler¶

Profiler = bytecode instrumentation, added to all places where new objects are allocated. More accurate but slower.
They optimiste by getting sample each n times

Dtrace/SystemTap¶

One other method that worked for HotSpot is allocation probe.
This worked in JDK8 but may not work in
All allocs are slower, large overhead

Other tools¶

Aprof, instrumenting profiler but very optimised; https://github.com/devexperts/aprof
Allocation instrumenter, actually a framework for writing your own allocation tests. https://github.com/google/allocation-instrumenter

Overhead¶

They have a large overhead apart from JMC.

JFR¶

JFR has very low overhead < 5% overhead for capturing allocation.
Now OS and backported to 8u262
Gives you the stack trace to where the objects are allocated from.

TLAB (Thread Local Allocation Buffer)¶

If enough space in TLAB, use simple TLAB alloc
If not enough space in TLAB, 2 options:
Allocate directly in EDEN
Or a new TLAB is allocated and the existing discarded

Simple TLAB¶

Fast: inlined
Allocation is handled here first if possible
The allocation requires no synch
Alloc involves simple pointer increment

Slower¶

Call to VM Runtime
Outside TLAB
Allocation of new TLAB

JFR, Async Profiler¶

They are both concerned only with slow pass allocation, i.e. outside of TLAB
This is tact taken by JFR and Async Profiler
Intercept slow pass alloc, record stack and object allocated
Works in OpenJDK 7u40+
Do commercial features required.

JDK 11¶

JEP-331 low overhead heap profiling
SampledObjectAlloc and SetHeapSamplingInterval
Additional boundary added to TLAB
Virtual boundary can be set to specific size to pick up allocations? Not entirely sure about its operation

demo7¶

Stand -e alloc is the number of allocations but we can also profile for the size of the allocations.
-o flamegraph=total will do this for alloc mode

Alloc profiling colours¶

Blue = Alloc in new TLAB Brown = Alloc outside TLAB

Outside TLAB means usually the object is too large for a TLAB
100KB-⅔MB TLAB size

G1¶

G1GC allows humongous allocations, objects that cover 1 or more G1 regions.
They usually affect perf badly
Async profiler can profile any native func
He uses it to pick up G1GC humongous allocs via: -e G1CollectedHeap::humongous_obj_allocate
Then start the app with -XX:+UseG1GC

Setting TLAB size¶

You can set TLAB size with -XX:TLABSize=10k 10KB TLAB size, smaller than default
Special hack to measure every single alloc in app: -XX:-UseTLAB which will turn off TLAB.
The above trick only works with G1GC, other GCs won't show this as they won't go through the VMruntime.

Capturing native allocs¶

These are disabled by default as most alloc is done in Java code and it would clutter the flamegraphs.
You can capture this information by using cstack option. i.e. cstack=fp
Sometimes you cannot get native traces because native library is compiled without maintaining fp calling convention
gcc will skip maintaining the linked list of frames, doesn't preserve fps.
You will then see truncated native stacks
Modern hw, from Intel Nehelen CPUs support HW stack walking by using Last Branch Record mechanism.
Processor keeps track of all jumps in the code, function or conditional branch. This allows the ability to match and return associated instructions.
From Linux 4.2~?, there is option to make perf events API to use this to get HW stacks.
Using LBR mechanism with OpenSSL we get specific good functions in the flamegraph from the library

demo 8¶

Async profiler has Java API
We can profile app using Java API rather than JVMTI or attach script.
This is used with async-profiler.jar

Profiling in Idea Ultimate¶

Similar but has nice feature that it shows the profiling by separate threads or as all grouped together.
Also the colours are different for the Idea
It has a nice tree view, etc.
You can also attach the IDEA profiler to app that is running.
You can open snapshots by external tool, in JFR or collapsed stack format into Idea.

demo10 Non-trivial native mem leaks¶

Simple servlet that provides an image by id
Gets the image from the resource stream
Sends as byte array
Requires deps
He uses the malloc even to see where the data is getting allocated.
Profiling malloc is not always the best for the leak as the memory may be released after being allocated.
Instead he uses mprotect profiling, malloc uses big chunks from memmap. It calls mprotect to commit more native memory.
This is specifically related to the growth of the RSS Resident Set Size.
The resource is compressed so the inflator is created.
The inflator has a finalize method, therefore the problem would be fixed by the gc kicking in but there are no guarantees.

FileTest¶

Tracepoints in the linux kernel that can be hooked
Its the logger causing the issue, the log location wasnt right and wasn't rotated correctly.
He identified that the page cache was growing, it is OS based.
Therefore he looked from the tracepoints in perf list and found one related to page cache, which he used to debug the issue.

Problems with Flamegraphs¶

Flamegraph is aggregated
Async profiler can write events, all samples if required

Demo12¶

Worried about CPU spikes that happen every 5s
Because it happens infrequently you cannot see why these spikes happen
Async profiler can do jfr recording

Flamescope¶

Events in heatmap style
Flamescope is not installed
Convert from jfr to flamescope using

The converter is built in the async profiler project:

java -cp build/converter.jar jfr2nflx <input.jfr> <out.nflx>

demo9¶

How to do HW performance counters

Example starts two threads, both are incrementing random elements of the array
Array is one million elements
1^st thread only touches small amount of array
2^nd thread touches all of the array
Normal CPU profiling won't show much
False sharing will show up using llc-load-misses

async-profiler events¶

Profile VMThread::execute for stop the world
Profile Deoptimization::uncommon_trap for
Profile java_lang_Throwable::fill_in_stack_trace
You can profile any java method, or even any java method in a class

demo 11¶

Scenario is exceptions thrown in the JVM, no sure of the source
We can use fully qualified method name as the classname

./profiler.sh -d 5 -f out.svg -e java.lang.NullPointerException.<init> jps

Things to look further¶

Flamescope

Tasks¶

Raise issue about nanotime, even when connecting directly as agent it didn't work as expected.
demo6 doesn't work as expected due to the classpath failure
Second part of the presentation doesn't seem to be there on the site.
Find out about mprotect event?

Questions¶

Does Async profiler rely on some Linux signals?
Async profiler works with what JVMs?
- It works with: HotSpot, Azul (no IBM :<)
- YourKit supports Async traces via AsyncGetCallTrace
Can you profile specific Java threads to decrease overhead
Do you need sudo permissions to run Profiler?
- You don't need root to run the image
Can Async profiler show threads waiting in native code? (real logic question!)
- Profiling in wall clock mode will show this, though you need to search for wait, specifically in the output and it can be hard to see.
Is there a tool that builds flamegraph from multiple instances?
- It is possible with JFR, collapsed stack traces
- These can then be concatenated with simple file concat
- Can be opened in IDEA
- Or use async converter to convert collapsed stack traces to flamegraph
Common problem, small functions which runs very fast mostly, 100 microseconds
- Option 1 reduce the profiling interval -i 100us, high overhead in production
- Option 2: include, exclude to filter interesting stack traces
- Option 3: turn on/off profiling at specific points
- Option 4: --begin --end when particular native function is reached

Last update: 2022-06-29