sub-second: crash

Friday, September 14, 2012

Dumping Very Large Java Heaps

When a java application has either a memory leak or much higher than expected memory utilization, it is necessary to obtain heap information to identify the source of the problem. A heap dump is ideal because it can then be analyzed using various tools. However, with very large java heaps, perhaps > 100GB, a heap dump may be impractical for several reasons:

the heap dump may crash the java process before completing
the heap dump may hang indefinitely
there may not be enough disk space to accomodate the dump
the dump may be so large that analysis tools are unable to process it

One solution to this scenario is to use the jmap utility to obtain a heap dump histogram from the running process. This appears to be very lightweight, completing quickly on very large heaps and generating a very small summary analysis file that can be used for troubleshooting.

The syntax for doing this is the following, where <pid> is the process id of the java process.

jmap -histo <pid>

The output is a very nice summary showing, for each class in the heap, the class name, the number of instances, and the size in bytes, for example as follows:

num #instances #bytes class name

----------------------------------------------

1: 70052 11118624 <constMethodKlass>

2: 70052 8422160 <methodKlass>

3: 6320 8258472 <constantPoolKlass>

4: 6320 6117216 <instanceKlassKlass>

5: 116656 5732520 <symbolKlass>

6: 17467 5729824 [I

7: 5682 5050352 <constantPoolCacheKlass>

8: 57275 4818512 [C

9: 24818 2660384 [B

10: 59327 1898464 java.lang.String

11: 2847 1766720 [J

12: 2978 1542008 <methodDataKlass>

13: 11687 797256 [S

14: 13307 706440 [Ljava.lang.Object;

15: 6777 704808 java.lang.Class

16: 18904 604928 java.util.HashMap$Entry

17: 10088 522512 [[I

18: 5736 499408 [Ljava.util.HashMap$Entry;

19: 12838 410816 java.util.Hashtable$Entry

20: 5580 267840 java.util.HashMap

21: 428 249952 <objArrayKlassKlass>

22: 5888 235520 java.util.concurrent.ConcurrentHashMap$Segment

23: 6243 199776 java.util.concurrent.locks.ReentrantLock$NonfairSync

24: 5888 146544 [Ljava.util.concurrent.ConcurrentHashMap$HashEntry;

...

3029: 1 16 sun.awt.X11.XToolkit$4

3030: 1 16 java.util.Collections$EmptyIterator

3031: 1 16 com.sun.tools.visualvm.core.explorer.ExplorerContextMenuFactory

3032: 1 16 sun.reflect.generics.tree.TypeVariableSignature

3033: 1 16 sun.awt.X11.XKeyboardFocusManagerPeer$1

3034: 1 16 org.openide.xml.EntityCatalog$Forwarder

Total 714547 74730032

Thursday, September 13, 2012

How to Test the Stability of an Application

Testing the stability of an application is critical. It can prevent system outages by identifying problems before they occur in production. Outages can severely damage a business, in some cases permanently. The following outline provides a reasonable template for testing application stability.

Ramp load up incrementally to the breaking point of the system. Do not stop at expected peak load because bursts or unexpected traffic can entail load far higher than anticipated.

Load should cover critical dimensions such as transaction rate/throughput, connections, concurrent users, range of use cases/functionality
When the application breaks, investigate what broke

If the test infrastructure broke (test client capacity hit, test network capacity hit, test case crashed, etc.), the test infrastructure must be repaired so that the application is what breaks, not the test infrastructure.
If the application broke, diagnose the type of breakage and what broke.
Is breakage recoverable?
Does breakage affect already connected users, or just block new users?
Did the application code break (errors, deadlocks, thread blocking, etc.)?
Was a system resource limit hit (cpu, memory, network, disk)?
If system resource limits were not hit, does the application need to be fixed so that it is not the bottleneck? The system should scale up so that system limits are hit, whether CPU, network, disk I/O, or network bandwidth.
Did a downstream service break?

How can the downstream service be improved to provide more capacity and stability?

Did the system just slow down, remaining functional?
Is a restart required, and what must be restarted (services, server, downstream services, etc.)?
Can the system be scaled out or scaled up to improve the capacity?

If not, why not? Is there an architectural limitation preventing further scalability? How can scalability be improved?

From the test determine the peak capacity of the application and verify that proper production monitoring is in place to detect this threshold.

Run at near peak capacity for an extended period of time (this could be one day or more depending on uptime requirements)

Is the application stable when run for a long time or does it eventually crash?

Why does it crash?

Does performance degrade over time?

Why does it degrade?

Perform administrative operations that may need to be performed during production usage while system is near peak load.

Is the system stable when this happens?

Perform the full suite of functional tests while the system is near peak load.

Is the system stable when this happens?

Document the results of the test carefully. Do not ignore crashes and instability. Spend the time and effort to understand the behavior and harden the application to behave well under any conditions, anticipated or not.