JVM Tenured/Old gen reached limit & server hanging

30,178

Solution 1

For your specific questions:

  1. The default ratio between new and old generations can depend on the system and what the JVM determines will be best.
  2. To specify a specific ratio between new and old generations with -XX:NewRatio=3.
  3. If your JVM is hanging and the heap is full it's probably stuck doing constant GC's.

It sounds like you need more memory for prod. If on QA the request finishes then perhaps that extra 0.67GB is all that it needs. That doesn't seem to leave you much headroom though. Are you running the same test on QA as will happen on prod?

Since you're using 12GB you must be using 64-bit. You can save the memory overhead of 64-bit addressing by using the -XX:+UseCompressedOops option. It typically saves 40% memory, so your 12GB will go a lot further.

Depending on what you're doing the concurrent collector might be better as well, particularly to reduce long GC pause times. I'd recommend trying these options as I've found them to work well:

-Xmx12g -XX:NewRatio=4 -XX:SurvivorRatio=8 -XX:+UseCompressedOops
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+DisableExplicitGC
-XX:+UseCMSInitiatingOccupancyOnly -XX:+CMSClassUnloadingEnabled
-XX:+CMSScavengeBeforeRemark -XX:CMSInitiatingOccupancyFraction=68

Solution 2

you need to get some more data in order to know what is going on, only then will you know what needs to be fixed. To my mind that means

  1. get detailed information about what the garbage collector is doing, these params are a good start (substitute some preferred path and file in place of gc.log)

    -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:gc.log -verbose:gc

  2. repeat the run, scan through the gc log for the period when it is hanging & post back with that output

  3. consider watching the output using visualgc (requires jstatd running on the server, one random link that explains how to do this setup is this one) which is part of jvmstat, this is a v easy way to see how the various generations in the heap are sized (though perhaps not for 6hrs!)

I also strongly recommend you do some reading too so you know what all these switches are referring to otherwise you'll be blindly trying stuff with no real understanding of why 1 thing helps and another doesn't. I'd start with the oracle java 6 gc tuning page which you can find here

I'd only suggest changing options once you have baselined performance. Having said that CompressedOops is v likely to be an easy win, you may want to note it has been defaulted to on since 6u23.

Finally you should consider upgrading the jvm, 6u18 is getting on a bit and performance keeps improving.

each job will take 3 hours to complete and almost 6 jobs running one after another. Last job when running reaches 8GB max and getting hang in prod

are these jobs related at all? this really sounds like a gradual memory leak if they're not working on the same dataset. If heap usage keeps going up and up and eventually blows then you have a memory leak. You should consider using -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/some/dir to catch a heap dump (though note with a 13G heap it will be a big file so make sure you have the disk space) if/when it blows. You can then use jhat to look at what was on the heap at the time.

Share:
30,178
raksja
Author by

raksja

Passionate Software Engineer, who love to code.

Updated on May 11, 2020

Comments

  • raksja
    raksja almost 4 years

    Our application requires very huge memory since it deals with very large data. Hence we increased our max heap size to 12GB (-Xmx).

    Following are the environment details

    OS - Linux 2.6.18-164.11.1.el5    
    JBoss - 5.0.0.GA
    VM Version - 16.0-b13 Sun JVM
    JDK - 1.6.0_18
    

    We have above env & configuration in our QA & prod. In QA we have max PS Old Gen (Heap memory) allocated as 8.67GB whereas in Prod it is just 8GB.

    In Prod for a particular job Old Gen Heap reaches 8GB, hangs there and the web URL become inaccessible. Server is getting down. But in QA also it reaches 8.67GB but full GC is performed and its coming back to 6.5GB or something. Here its not getting hanged.

    We couldn't figure out a solution for this because both the environment and configuration on both the boxes are same.

    I have 3 questions here,

    2/3rd of max heap will be allocated to old/tenured gen. If that is the case why it is 8GB in one place and 8.67GB in another place?

    How to provide a valid ratio for New and Tenure in this case(12GB)?

    Why it is full GCed in one place and not in the other?

    Any help would be really appreciable. Thanks.

    Pls let me know if you need further details on env or conf.