Can the JVM recover from an OutOfMemoryError without a restart

23,840

Solution 1

It may work, but it is generally a bad idea. There is no guarantee that your application will succeed in recovering, or that it will know if it has not succeeded. For example:

  • There really may be not enough memory to do the requested tasks, even after taking recovery steps like releasing block of reserved memory. In this situation, your application may get stuck in a loop where it repeatedly appears to recover and then runs out of memory again.

  • The OOME may be thrown on any thread. If an application thread or library is not designed to cope with it, this might leave some long-lived data structure in an incomplete or inconsistent state.

  • If threads die as a result of the OOME, the application may need to restart them as part of the OOME recovery. At the very least, this makes the application more complicated.

  • Suppose that a thread synchronizes with other threads using notify/wait or some higher level mechanism. If that thread dies from an OOME, other threads may be left waiting for notifies (etc) that never come ... for example. Designing for this could make the application significantly more complicated.

In summary, designing, implementing and testing an application to recover from OOMEs can be difficult, especially if the application (or the framework in which it runs, or any of the libraries it uses) is multi-threaded. It is a better idea to treat OOME as a fatal error.

See also my answer to a related question:

EDIT - in response to this followup question:

In other words if an OOME is thrown in an application server (jboss/websphere/..) do I have to restart it?

No you don't have to restart. But it is probably wise to, especially if you don't have a good / automated way of checking that the service is running correctly.

The JVM will recover just fine. But the application server and the application itself may or may not recover, depending on how well they are designed to cope with this situation. (My experience is that some app servers are not designed to cope with this, and that designing and implementing a complicated application to recover from OOMEs is hard, and testing it properly is even harder.)

EDIT 2

In response to this comment:

"other threads may be left waiting for notifies (etc) that never come" Really? Wouldn't the killed thread unwind its stacks, releasing resources as it goes, including held locks?

Yes really! Consider this:

Thread #1 runs this:

    synchronized(lock) {
         while (!someCondition) {
             lock.wait();
         }
    }
    // ...

Thread #2 runs this:

    synchronized(lock) {
         // do something
         lock.notify();
    }

If Thread #1 is waiting on the notify, and Thread #2 gets an OOME in the // do something section, then Thread #2 won't make the notify() call, and Thread #1 may get stuck forever waiting for a notification that won't ever occur. Sure, Thread #2 is guaranteed to release the mutex on the lock object ... but that is not sufficient!

If not the code ran by the thread is not exception safe, which is a more general problem.

"Exception safe" is not a term I've heard of (though I know what you mean). Java programs are not normally designed to be resilient to unexpected exceptions. Indeed, in a scenario like the above, it is likely to be somewhere between hard and impossible to make the application exception safe.

You'd need some mechanism whereby the failure of Thread #1 (due to the OOME) gets turned into an inter-thread communication failure notification to Thread #2. Erlang does this ... but not Java. The reason they can do this in Erlang is that Erlang processes communicate using strict CSP-like primitives; i.e. there is no sharing of data structures!

(Note that you could get the above problem for just about any unexpected exception ... not just Error exceptions. There are certain kinds of Java code where attempting to recover from an unexpected exception is likely to end badly.)

Solution 2

I'd say it depends partly on what caused the OutOfMemoryError. If the JVM truly is running low on memory, it might be a good idea to restart it, and with more memory if possible (or a more efficient app). However, I've seen a fair amount of OOMEs that were caused by allocating 2GB arrays and such. In that case, if it's something like a J2EE web app, the effects of the error should be constrained to that particular app, and a JVM-wide restart wouldn't do any good.

Solution 3

The JVM will run the GC when it's on edge of the OutOfMemoryError. If the GC didn't help at all, then the JVM will throw OOME.

You can however catch it and if necessary take an alternative path. Any allocations inside the try block will be GC'ed.

Since the OOME is "just" an Error which you could just catch, I would expect the different JVM implementations to behave the same. I can at least confirm from experience that the above is true for the Sun JVM.

See also:

Solution 4

Can it recover? Possibly. Any well-written JVM is only going to throw an OOME after it's tried everything it can to reclaim enough memory to do what you tell it to do. There's a very good chance that this means you can't recover. But...

It depends on a lot of things. For example if the garbage collector isn't a copying collector, the "out of memory" condition may actually be "no chunk big enough left to allocate". The very act of unwinding the stack may have objects cleaned up in a later GC round that leave open chunks big enough for your purposes. In that situation you may be able to restart. It's probably worth at least retrying once as a result. But...

You probably don't want to rely on this. If you're getting an OOME with any regularity, you'd better look over your server and find out what's going on and why. Maybe you have to clean up your code (you could be leaking or making too many temporary objects). Maybe you have to raise your memory ceiling when invoking the JVM. Treat the OOME, even if it's recoverable, as a sign that something bad has hit the fan somewhere in your code and act accordingly. Maybe your server doesn't have to come down NOWNOWNOWNOWNOW, but you will have to fix something before you get into deeper trouble.

Solution 5

You can increase your odds of recovering from this scenario although its not recommended that you try. What you do is pre-allocate some fixed amount of memory on startup thats dedicated to doing your recovery work, and when you catch the OOM, null out that pre-allocated reference and you're more likely to have some memory to use in your recovery sequence.

I don't know about different JVM implementations.

Share:
23,840
sengs
Author by

sengs

Updated on April 09, 2021

Comments

  • sengs
    sengs about 3 years
    1. Can the JVM recover from an OutOfMemoryError without a restart if it gets a chance to run the GC before more object allocation requests come in?

    2. Do the various JVM implementations differ in this aspect?

    My question is about the JVM recovering and not the user program trying to recover by catching the error. In other words if an OOME is thrown in an application server (jboss/websphere/..) do I have to restart it? Or can I let it run if further requests seem to work without a problem.

  • Stephen C
    Stephen C about 14 years
    "Generally frameworks that run other code, like application servers, attempting to continue in the face of an OME makes sense". I disagree, unless the framework is extremely robust, attempting to recover from an OOME can result in (for example) a catatonic server. Been there, seen that!
  • Yishai
    Yishai about 14 years
    @Stephen C, I shudder to think what calling System.exit(1) on any OME in JBoss would look like. Every time a user tired to read too much data, everyone goes down. I agree that it can lead to problems, but the most likely cause of an OME for an app server is user code doing too much, and as long as they catch it at a point where the user code allocations are no longer reachable, full-recovery is the most likely outcome and worth coding for, IMO.
  • Stephen C
    Stephen C about 14 years
    @Yishai - a bad request (e.g. user tried to read too much data) should not be allowed to cause an OOME in the first place. The correct fix is to make the request processing more defensive ... not to try to recover from OOMEs.
  • Yishai
    Yishai about 14 years
    @Stephen C, the author of an application server doesn't have that option.
  • Stephen C
    Stephen C about 14 years
    @Yishai - yes he/she does. Just provide a way for ding-bat application developers / deployers to enable dodgy OOME recovery.
  • Yishai
    Yishai about 14 years
    @Stephen C, in other words recover from it ;).
  • Stephen C
    Stephen C about 14 years
    @Yishai - well try to recover from it. As I said in my example, it is difficult to know if an OOME recovery has really worked.
  • Raedwald
    Raedwald over 12 years
    "other threads may be left waiting for notifies (etc) that never come" Really? Wouldn't the killed thread unwind its stacks, releasing resources as it goes, including held locks? If not the code ran by the thread is not exception safe, which is a more general problem.
  • Raedwald
    Raedwald over 12 years
    "The OOME may be thrown on any thread" and at any time, not just at a new. And that includes in locations that leave your program in an inconsistent state. See stackoverflow.com/questions/8728866/…
  • Stephen C
    Stephen C over 11 years
    @Yishai - a better answer would be to say that it is not the responsibility of the framework to cope with dingbat apps, developers, etc that cannot manage the apps demands on memory.
  • SpaceTrucker
    SpaceTrucker almost 11 years
    For your edit 2 I would tend to say that lock.wait(); is the problem, not the OOME itself, because the same behaviour could be caused by any runtime exception.
  • Stephen C
    Stephen C almost 11 years
    @SpaceTrucker - The difference is that other exceptions don't happen spontaneously. They happen as a result of either a bug, or some condition that is predictable at some level. OTOH, OOME and other Errors can happen spontaneously! Where/when you run out of memory cannot be predicted ... in practice. But it really doesn't matter who or what is "at fault". The issue is that this kind of thing makes recovery from OOME's problematic.
  • killjoy
    killjoy about 7 years
    I would recommend catching it high on the stack. i.e. at the beginning of the task. That way, it can try to gc any data which would have been "in scope" at the time it ran out of memory.
  • TomCZ
    TomCZ over 2 years
    I suppose one can carefully design and test an app to keep the state consistent after an OOM. But if you use any third party libs then you never know. Check out this one: issues.apache.org/jira/browse/HTTPCLIENT-2039
  • Stephen C
    Stephen C over 2 years
    Yup. That would be the kind of problem I alluded to when I said: " ... especially if the application (or the framework in which it runs, or any of the libraries it uses) is multi-threaded.".