Openfire out of memory with zombie sessions

Conrad_Chan · February 20, 2013, 6:33pm

We’re using the open fire 3.7.1 code base and what we are encountering is that open fire runs out of memory due to http sessions not getting cleaned up. Upon examining the heap dump, there were 1 GB of HttpSession objects where the majority of the memory is taken up by HttpSession Deliverable objects filling up the pendingElements array list which represents events that should be published to this http session and user but are unable to be delivered.

Upon further inspection, the HttpSession that had the largest share of memory had to last activity value about two weeks prior to the OOM, which is way longer than the 30 second inactivity timeout that should cause the HttpSession Reaper to clean up that session.

Has anybody encountered such behavior where zombie sessions are not getting cleaned up?

Thanks,

Conrad

akrherz · February 21, 2013, 2:48am

Have you tried the current trunk, soon to be 3.8.1 ? There was work done with HTTP sessions that may help this.

daryl

Walter_Ebeling · February 21, 2013, 6:29am

Haven’t seen that, but may I ask how did you track down these zombie sessions? Can you briefly describe what you did and what tools you have used?

Alan7 · February 26, 2013, 10:04pm

We actually got a heap dump from a customer site from an OOM crash, so the heap dump showed that there were a ton (182) of HttpSessions that have way exceeded the inactivity timeout.

What’s also of interest is that we see that the flags in HttpSession isClosed flag is still set to false and the VirtualConnection status is not -1 (CLOSED), even though this is clearly one of the first things that are set by the close() method that the HttpSessionReaper calls.

This leads us to believe that either the Reaper for some reason stops running periodically (30s), or keeps hitting some sort of a RuntimeException early on in it’s execution (a really corrupted session in the sessionMap) preventing it from getting to the rest of the sessions.

However, there are try/catch blocks within the Reaper’s session.close call stack that limits (but doesn’t entirely preclude the possibility of an uncaught RuntimeException.

Any help with anyone more familiar with this code is certainly appreciated.

We were also lucky enough to catch a thread overview in our heap dump that shows the a pool-openfire thread with the Reaper in action and not deadlocked (it was in the middle of appending a string for a JID), so we have good reason to believe that the Reaper has not died dead and is still running every 30 seconds, but for some reason, fails to be able to clean the rest of these “zombie” sessions.