Memory leak, caused by use of java.util.concurrent?

Hi there,

We’'ve had some significant memory leaks on our Wildfire/Openfire deployment, since upgrading to v3.2.3. Memory usage is initially stable as a rock. After some, as yet unknown, trigger, memory usage in the Old Gen climbs linearly, until the point where about 70% of Old Gen is used. At this point the JVM triggers a full garbage collect. Eventually this flatlines the memory usage (nothing is collected), and this freezes the Wildfire process for all intents and purposes, because it keeps doing triggered garbage collects. This has happened about three times now in a week, with a restart the only option.

I’'ll attempt to recap what we have so far (rather extensive):

A heap dump made at the time of this incident shows that well over half the space reserved for Old Gen is filled with objects of type java.util.concurrent.locks.AbstractQueuedSynchronizer$Node (about 400+MB, spread over 12.3mln instances), which appears equal to:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6457125

Speculation:

Eventually we found an instance of a subclass of AbstractQueuedSynchronizer, java.util.concurrent.Semaphore$NonfairSync. This object is a queue, and a fairly long one (few minutes of clicking through to ‘‘next’’ yielded no end)

There were about 20 Thread instances who had this object in their ‘‘parkBlocker’’ field. These threads were loaded by an object:

org.mortbay.jetty.webapp.WebAppClassLoader

Which has the basic Jive class loader as its parent.

The queue is also referenced from a java.util.concurrent.Semaphore object (in its ‘‘sync’’ field), which is referenced by org.jivesoftware.wildfire.http.HttpSession, field ‘‘packetsToSendSemaphore’’.

Mind you, the link to HTTP binding is speculation, since we cannot verify at this point that this Semaphore$NonfairSync has a significant portion of the 12.3 million AbstractQueuedSynchronizer$Node’'s (spidering a web-based jhat report with an index page that is 1.6GB in size maxed out our quad core Xeon 8GB memory machine :-). But it seems a likely candidate.

Has anyone else encountered related problems? Has anything significant changed in HTTP Binding with regards to the java.util.concurrent package, or with another part of Wildfire, that I couldn’'t find? We will of course continue to check into this ourselves.

Frank van Schie

Hi,

which Java version are you using?

LG

We recently added in the semaphore code to prevent some deadlocks that were occurring. I will investigate if somehow the semaphores are not being released.

Thanks,

Alex

Hi Alex,

It seems that the issue is caused by the fix for JM-1001 (SVN rev 7553). Our problem appears right after we upgraded to 3.2.3 (which included that fix). The objects Frank mentioned are all attributes (of attributes of attributes) of the ‘‘packetsToSendSemaphore’’ attribute in a HttpSession object.

We haven’‘t been able to count the exact number of AbstractQueuedSynchronizer$Node instances that are linked to that particular Semaphore object (debugging tools tend to run forever on our entire stackdump ), but as there are a lot more objects than we would expect, there’'s most probably something wrong there.

Hey guys,

I have filed JM-1022 and checked in a possible fix just now. If one of you could test the fix and let me know if you are still experiencing the issue, that would be most helpful. I have been unable to reproduce the state locally.

Thanks,

Alex

Hi Alex,

We’‘re working on a more predictable reproduction. I’‘d like to test out this stuff before I’'m throwing it into production (which is currently rolled back to 3.2.2).

I’'ll let you know.

Hi Alex,

I added a comment to the Jira issue.

To add to the issue:

The Sun bug that I mention in my JM-1022 comment is closed, marked ‘‘fixed’’.

If you read the Sun bugreport, there’'s a comment that not all problems seem to be solved by that fix - in particular tryAquire() calls keep causing issues. The current Openfire implementation uses tryAquire().

Furthermore, the fix that was added to close this issue did not seem to be included in the JVM (build 1.6.0-b105) that I downloaded from Sun last week. Running the first test that’'s included in the bugreport fails for me, and the diffs that are not applied to the sources that came with my JDK.

In short: the Semaphore fix doesn’‘t fix everything, and doesn’'t seem to be included in current releases of Java 1.6.

The bug is actually marked as fixed for 1.6.0 update 2, which is in beta now:

http://download.java.net/jdk6/binaries/

-Matt

As the problem only occurs in production, we’‘re not comfortable running a beta JVM that includes a questionable fix. Instead, we’‘d ideally like to see the entire Semaphore mechanism to be removed. As the same functionality is implementable by ordinary Java synchronization, I’‘d opt for that. I think it’'s less complex anyway.

What struck me as odd was that The HttpSessionManager (through the work thread), first polls the session for remaining packets (doing tricky semaphore stuff here to make sure only one consumer is active at a time). After that, the HttpSessionManager creates a router object based on that same Session (from which it just extracted data), which in the end is used to send any queued packets (again, the stuff from the Session).

It seemed more logical to me to make the HttpSession itself responsible for routing any queued packets. By doing that, you can remove the method that queries the Session for pending packets and the method that releases the lock that’'s set by that operation. Instead, one new ‘‘sendPendingPackets’’ method is added, that simply sends any remaining packets itself. By making that method synchronized, we make sure that no two worker threads act on that specific queue simultaneously. Separate Sessions can still be handled concurrently though.

I’'ve included our suggested modifications in the attachment. Please note that their are based on the 3.2.3 tag in SVN.

I’‘ve tried to change as little as possible to other code that isn’‘t directly related. I think there’‘s a couple of other modifications that could be considered, but that’‘s something for a possible step 2. I’'ve made some comments in the code that reflect some of these changes.

This is my proposal for the Runnable in HttpSessionManager. I’'ve commented out the original code.

/** * A runner that gurantees that the packets per a session will be sent and * processed in the order in which they were received. */
private class HttpPacketSender implements Runnable
{
     private HttpSession session;      HttpPacketSender(HttpSession session)
     {
          this.session = session;
     }      public void run()
     {
//               Collection<Element> elements = null;
//               try
//               {
//                    elements = session.getPacketsToSend(20, TimeUnit.MILLISECONDS);
//               }
//               catch (Throwable t)
//               {
//                    /** Do nothing * */
//               }
//               if (elements == null)
//               {
//                    this.init();
//                    return;
//               }
//
//               SessionPacketRouter router = new SessionPacketRouter(session);
//
//               try
//               {
//                    for (Element packet : elements)
//                    {
//                         try
//                         {
//                              router.route(packet);
//                         }
//                         catch (UnsupportedEncodingException e)
//                         {
//                              Log.error(
//                                   "Client provided unsupported encoding type in auth request",
//                                   e);
//                         }
//                         catch (UnknownStanzaException e)
//                         {
//                              Log.error("Client provided unknown packet type", e);
//                         }
//                    }
//               }
//               finally
//               {
//                    session.releasePacketsToSend();
//               }
          session.sendPendingPackets();
     }      /*
      * TODO: the inital call to init() if elements == null seems to encourage never-ending loops. I removed
      * it in my alternative, but we should check if this is safe. If it is not, we can easily adjust      
      * HttpSession.sendPendingPackets() to return a boolean value instead.
      */
     private void init()
     {
          sendPacketPool.execute(this);
     }
}

Next, the addition I made to HttpSession. Note that much of this code is directly taken from the code that I commented out in the previous listing.

/** * This methods sends any pending packets in the session. If no packets are * pending, this method simply returns. The method is declared synchronized * to avoid simultanious sending operations on this Session object. If two * threads try to run this method simultaniously, the first one will trigger * the pending packets to be sent, while the second one will simply return * (as there are no packets left to send). */
protected synchronized void sendPendingPackets()
{
     if (packetsToSend.size() <= 0)
     {
          return;
     }
     final Collection<Element> elements = packetsToSend.remove();      /*
      * TODO: This is a result of copy/pasting the code from
      * HttpSessionManager. Can we make this SessionPacketRouter a class
      * attribute? That would save us from having to recreate it for each
      * batch of packets that''s being sent.
      */
     final SessionPacketRouter router = new SessionPacketRouter(this);      for (Element packet : elements)
     {
          try
          {
               router.route(packet);
          }
          catch (UnsupportedEncodingException e)
          {
               Log.error(
                    "Client provided unsupported encoding type in auth request",
                    e);
          }
          catch (UnknownStanzaException e)
          {
               Log.error("Client provided unknown packet type", e);
          }
     }
}

Note that the getPacketsToSend() and releasePacketsToSend() methods in HttpSession are no longer used.

We’‘ve successfully deployed an Openfire patched like this to our test domain. We haven’'t tried stress or load tests, but normal client behavior (using http binding) seems to be ok.

Guus,

Changed the patch slightly… check it out:

http://www.igniterealtime.org/fisheye/changelog/svn-org/?cs=7985

Thanks,

Alex

That looks fine to me. I’‘ve adjusted my local patch to match yours. We’‘re probably deploying it on our production server early next week. I’'ll let you know if we run into something. Thanks for your help!

On a side note: will there be a new tag in the 3.2.x tree for this?

Guus,

Let me know how your tests go in your local environment and then we will tag and release from the 3.2.x branch.

Thanks,

Alex