Big memory problem with Connection Manager 3.4.1: java.lang.OutOfMemoryError: Java heap space

Good morning:

We have been suffering some memory problems with the connection manager, up to the point that

we cannot use it for more than some hours.

For some hours, the heap memory gets gc, but it keeps growing.

At some point we get error messages of type java.lang.OutOfMemoryError: Java heap space.

Then it’s not possible to connect to the CM anymore.

I have been taking jmap heap dumps and analysing them.

Maybe if you see the analysis, somebody can give a clue about what to

look for in the code and try to get a fix.

I am somewhat surprised than nobody has found this problem. Maybe it’s

because of the clients we use, we use some some custom made flash

clients that are starting sessions and probably not ending them

properly. But i think the server should end this sessions sometime and

free resources.

These are the options for the java VM.

JVM_SETTINGS="-Xmx512m -Xms512m -Xss80k
-XX:+HeapDumpOnOutOfMemoryError -XX:GCTimeRatio=19 -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-Xloggc:/opt/connection_manager/logs/garbage-collector-`date'+%F-%H-%M-%S'`.log"

The java version is:

java version “1.6.0_03”

Java™ SE Runtime Environment (build 1.6.0_03-b05)

running in fedora linux.

This is the histogram at the end of the test:

Class                                                          Instance Count   Total Size
class [B                                                                 114705  458730375
class java.nio.HeapByteBuffer                                            111653    3907855
class org.apache.mina.common.support.DefaultWriteFuture                  111648    2344608
class org.apache.mina.common.SimpleByteBufferAllocator$SimpleByteBuffer  111649    1451437
class org.apache.mina.common.IoFilter$WriteRequest                       111648    1339776
class [C                                                                   9127    1058136
class java.util.concurrent.ConcurrentLinkedQueue$Node                    112071     896568
class java.util.concurrent.atomic.AtomicInteger                          112328     449312
class [I                                                                   2441     229152
class [S                                                                   2555     148932
class java.lang.Class                                                      1913     145388
class java.lang.String                                                     7318     117088

It looks that the HeapByteBuffer are not freed properly, and WriteRequests get stuck, or something like that.

I have attached a file showing how the instance count evolves with

time, and the log of the garbage collector.

I hope that somebody can give a clue about which part of the code should i look into,

because i haven’t found where to start.

¡¡Thanks a lot for openfire, it has worked perfectly up to now!!

– Gaizka

I see that one of the abundantly available instances is the node object of java.util.concurrent.ConcurrentLinkedQueue, while the number of Queue instances is low enough to not be included in your dump. This might indicate that your problem can be debugged the same way I debugged a problem that I once had while using linked lists (documented in YMSG-12).

Basically, what I had (and what you may have), is one (or very few) linked lists of objects where one object links to the next object, which links to the next object, and so on, and so on. It could be very interesting to see what links to that first Node object.

I have been using OQL queries in the JHat application (which you can use to analyze JMap dumps) to find this ‘first’ nodes (you should replace the hex value for the value of your java.util.concurrent.ConcurrentLinkedQueue$Node class):

select n from instanceof 0x36a810e8 n where contains(referees(n), "classof(it).name == 'java.util.concurrent.ConcurrentLinkedQueue$Node'") && !contains(referrers(n), "classof(it).name == 'java.util.concurrent.ConcurrentLinkedQueue$Node'")

This query basically tells JHat to search for all Nodes that do not have another Node as a referrer. Note that queries like these can take up to a couple of hours to run, depending on the size of your memory dump and the resources available on the computer running JHat.

Hi!

First of all, thanks a lot for your attention!!

I have run that query (select n from instanceof 0x36a810e8 n where contains…)

and i have got 154 instances of ConcurrentLinkedQueue$Node with no Node as referrer.

I have also run the query of YMSG-12 (http://www.igniterealtime.org/issues/browse/YMSG-12),

for each one of this instances

select map(heap.findObject(0x2fb5dd88), function (it) { var res = 0;
    while (it != null) { res ++; it = it.next; } return res; }

I have attached the plot of their length, mainly it’ around 900 nodes.

Now, as you ask, i have found what is referencing this 154 Node with no Node as referer.

They are 154 ConcurrentLinkedQueue, which in turn are referenced by:

4    org.apache.mina.transport.socket.nio.SocketIoProcessor
150  org.apache.mina.transport.socket.nio.SocketSessionImpl

Each of this SocketSessionImpl is referenced by around 900

org.apache.mina.common.support.DefaultWriteFuture

I hope that you can get some meaning out of this.

I have read in YMSG-12 that you did "The queue from the Java

concurrent package was replaced by a synchronized linked list."

I am willing to try that, if you can enlight me a little.

I’ll try changing (in file ConnectionManager.java) the lines:

Executor ioExecutor = new ThreadPoolExecutor(
         ioThreads + 1, ioThreads + 1, 60, TimeUnit.SECONDS, new LinkedBlockingQueue<Runnable>() );

Thanks a lot for your time!!

– Gaizka

Note that OpenYMSG is a project that doesn’t have that much in common with your current problem. I was merely attempting to describe a way to debug your problem using my attempt for a particular (but totally unrelated) problem I had in OpenYMSG. Blindly applying my fix in OpenYMSG to the ConnectionManager code will probably not get you anywhere.

I just now noticed your comment on ‘not closing custom Flash clients’ properly. Could you elaborate on that a little? Do you, for instance, see an equal number of client connections as the amount of SocketSessionImpl instances (around 150 in your last example)? Do you expect that many concurrent client connections? If not, could you check on the operating system level if the amount of open sockets matches your expectations (using lsof, for example)?

Your comment that you were somewhat surprised that no-one else encountered the problem, could be close to the cause of your problems: something specific to your environment. Those clients you mentioned would be a very likely candidate. That would also mean that changing the code of Openfires Connection Managers isn’t the best place to fix your problem.

Would it be totally out of the question for you to adjust the clients to close their connection properly after they are done/used/quit? This is rather easily done by having the clients send a closing stream tag ( </stream:stream> or </flash:stream> depending on your setup) that matches the tag that you used to initially start the conversation with the XMPP domain, and closing any socket that they use to connect to the domain to.

It might be worth noting that we’re completly on the wrong track if that 460 mbs of byte arrays in your original memory dump isn’t spread equally over those 150-ish connections. After all, the instance count of those byte arrays resemble, but do not match, the instance counts of the other objects. Maybe it’s worth looking into those referres further.

I have been checking the program we use to stress test the server.

It turns out that there was a problem with it, there was a lot of sessions which were sending a message to a disconnected user.

These messages generate a message been sent again to the client.

The client was not reading all of these messages, and (i guess) they were being placed in a queue.

When the memory usage (in conn. manager) raises and raises, i have checked (with netstat) that there are lots of connection with Send-Q (The count of bytes not acknowledged by the remote host.)

So i guess that the problem is with that. Probably our flash clients make something like that (but much slower), i have to check that again.

I know it probably is a problem that we have to check, but, i guess there’s a problem also with the connection manager code. Since i can make a DOS, by just:

  • connecting as a regular user

  • start writing messages to unavailable or inexistent user

  • not read the error response packets

Note that this only happens with the CM, not by connecting to the openfire server directly.

Also, i have been changing the code (for example, using PooledByteBufferAllocator instead of SimpleByteBufferAllocator in ConnectionManager.java) but i haven’t been able to fix it. I’ve also tried getting checking the IOFuture return of IOSession.write , but then performance drops hugely (since it’s then almost synchronous behavior)

Maybe you can give me some tips about what else to try. I’d be really grateful.

So, in the end, i don’t know if this should be considered as a Connection Manager bug or not.

Thanks a lot for your attention

Update:

I have changed the code of the stress-bot, to read every packet that is sent by the server.

Supposedly, it would fix the problem above. Well, memory usage (albeit slower at the beginning) starts to raise again

Now the network Send-Q is empty, but i am getting again lots of java.util.concurrent.LinkedBlockingQueue$Nodes.

I can send you the code (it’s a Ruby script) so you can reproduce this problem.

But it only does this:

  • connect and auth to the server

  • starts writing messages to a disconnected user in it’s roster

  • ready any incoming message from server

It does it for 300 sessions connecting as the same user with different resources. Also, i block presences between these users with the packetfilter (this could cause cpu problems to the openfire server, but not anything to the connection manager, i guess).

Thanks!!

Hi, can you help me? I met this issue also, I use the connection manager 3.6.3(the latest), there are two connection manager deployed in one machine, connect to the openfire(the same machine), it run 100% cpu and huge memory even I set the xms:256m xmx:256 for the jvm setting, I’m not sure why, there are only several user connect to connection manager, and less than one hour, the issue occur, it is always reproducible, any idea about this issue?