Openfire service crashing daily

guus · January 4, 2010, 5:37pm

Interesting - there’s a substantial spike on all graphs on the end. Can you explain this with user patterns (did everyone come into the office) or could this indicate the point where your problem starts?

It’s hard to tell for sure (as there’s not a full history worth of data) but it looks like the amount of threads is growing out of control (see thread1.jpg). I advise you to make regular threaddumps, and compare them.

HeapMem.jpg shows that the JVM is using but a fraction of the memory that you’ve made available to it. You can turn this back a notch or two, if you want. You’d expect more of a jigsaw-like pattern in this graph, which would be just fine.

guus · January 4, 2010, 6:13pm

I’ve found this whitepaper, in which you might VMWare specific hints, tips and/or clues: http://www.vmware.com/resources/techresources/1087

LG1 · January 4, 2010, 6:58pm

Hi,

looking at Stanza1.jpg I wonder whether the users are really exchanging 1500 presence packets per minute/second. I wonder whether this is an Openfire issue caused by one client. The client may be hard to identify, anyhow “Server Settings”, “Message Audit Policy” allows one to audit presence packets to verify that these packets are fine.

LG

guus · January 4, 2010, 8:54pm

One of my co-workers that works with VMWare in a professional capacity tells me that Java and VMWare do play nice, as long as you follow the guidelines in that whitepaper I already linked.

Troy5 · January 4, 2010, 11:07pm

Thanks for the whitepaper. Most of the guidelines are VMware best practice, which we’ve been following. The only thing I noticed that was odd is they recommend setting a memory reservation for the amount of RAM that is allocated to the VM. This is something we will need to try.

Troy5 · January 4, 2010, 11:17pm

Nothing was unusual at that time. Connection count was about 1300 connections…we don’t exceed 1500 connections. I expect most of our users to connect between 7AM - 9AM EST. The crash happened at 12:38PM EST.

I am continuing to monitor with Java-Monitor…this should help with the historical data. Could you go into more detail on how to make the threaddumps?

We are in the process of building a new 64bit Server 2003 VM to move this to. Don’t know if it will help but I need to try something.

guus · January 5, 2010, 8:31am

Threaddumps can be made in a variety of ways. They are explained in this article: http://java.sun.com/developer/technicalArticles/Programming/Stacktrace/

Have you considered moving Openfire to a machine that is not virtual? That would help you to rule out VMWare as a contributor to your problem.

Troy5 · January 5, 2010, 3:38pm

I am posting more Java-Monitor graphs. This seems a bit odd to me. I noticed my heap memory usage drop to about 25% of what it was, the threads where cut in half and there was a spike in Garbage2.jpg. When I first noticed this drop we had about 1270 sessions. At the time of me gather the graphs we had 1360 sessions. So the session count is rising but my heap memory usage just plummeted.

We will also be rebuilding this system on a physical server as recommended.

Thanks for everyones suggestions so far…
010510.zip (169054 Bytes)

LG1 · January 5, 2010, 4:52pm

Hi,

I assume that you did restart your server at 2010.01.04-16:xx (y.m.d:H:M) and that at 2010.01.05-14:xx a “Full GC” did occur. Thread1.jpg shows that the number of “runnable” and “timed waiting” threads does not change while the number of “waiting” threads drops, likely caused by the “Full GC”.

Attaching VisualVM to your Openfire process allows you to get stacktraces and heapdumps, so you really may want to install it on a client machine and connect it to Openfire.

LG

Troy5 · January 5, 2010, 7:57pm

Just crashed again… Posting new graphs. I hope I am not overwhelming everyone with information. I just don’t want to miss anything that would be relevant.

A fatal error has been detected by the Java Runtime Environment:

java.lang.OutOfMemoryError: requested 32756 bytes for ChunkPool::allocate. Out of swap space?

Internal Error (allocation.cpp:117), pid=1296, tid=1376

Error: ChunkPool::allocate

JRE version: 6.0_17-b04

Java VM: Java HotSpot™ Client VM (14.3-b01 mixed mode windows-x86 )

If you would like to submit a bug report, please visit:

http://java.sun.com/webapps/bugreport/crash.jsp

--------------- T H R E A D ---------------

Current thread (0x56fc0800): VMThread [stack: 0x57050000,0x570a0000] [id=1376]

Stack: [0x57050000,0x570a0000], sp=0x5709fc58, free space=319k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [jvm.dll+0x1e66b7]
V [jvm.dll+0xa0c9c]
V [jvm.dll+0x27e6]
V [jvm.dll+0x29e2]
V [jvm.dll+0x2be3]
V [jvm.dll+0x1935c3]
V [jvm.dll+0x1c7f3]
V [jvm.dll+0x1cc28]
V [jvm.dll+0x1ce41]
V [jvm.dll+0x1e852e]
V [jvm.dll+0x1e887c]
V [jvm.dll+0x1e8ca2]
V [jvm.dll+0x173e4c]
C [MSVCR71.dll+0x9565]
C [kernel32.dll+0x2482f]

VM_Operation (0x5875f2ac): BulkRevokeBias, mode: safepoint, requested by thread 0x57d21800
010510_2.zip (176727 Bytes)

LG1 · January 5, 2010, 8:24pm

Hi,

you may want to monitor the VM size of your process with “pslist -m 1234” where 1234 should be your Openfire pid. “pslist -t 1234” could also be interesting but I think that it is a VM memory problem.

LG

guus · January 5, 2010, 9:02pm

The thread usage pattern is a bit odd, as others also mentioned. This might indicate a problem, but also might simply be a result of object that linger for an extreme long time, awaiting for finalization. As you assigned a lot more memory to Openfire than that it needs, it can take a long, long time before stuff is cleaned out.

I’d be interested in the thread dumps, as I’ve mentioned earlier. That might give us a clue as to what all those ‘extra’ threads are about.

For now, I’m still going with a problem that relates to the operating system itself. Any luck preparing a non-virtual host yet?

Troy5 · January 5, 2010, 9:35pm

We are currently working on moving this to a physical server.

We had the memory set so high because we thought it was a lack of JVM memory that was causing our problems. That just isn’t the case. We haven’t adjusted it back yet.

I emailed you the complete dumps from when the service crashes. They list all the threads at the time the service crashed. Let me know if that is what you are looking for.

guus · January 5, 2010, 9:50pm

As I replied to you by mail:

This wasn’t exactly what I’m looking for, but does give me some information. At first sight, there appear to be to much Timer threads, which appear to relate to the library that implements the Yahoo Messenger protocol. Any chance that you’d be able to run a couple of days with the Yahoo gateway disabled?

LG1 · January 6, 2010, 10:51am

Hi,

another issue one can see in Graphs.zip / ThreadCount.png: “Total threads created: 12517”. Creating and destroying threads allways leaves some memory fragments which can not be freed so with 11000 terminated threads the JVM will have allocate a lot of memory which is unusable.

LG

==> It may help a lot to use a 64 bit JVM.

Troy5 · January 13, 2010, 4:37pm

I will probably jinx myself by doing this…but we have had a stable system for 62 hours. It normally wouldn’t last more then 12-14 hours.

We decided to keep the server running in a virtual server but rebuilt a new 64bit OS and 64bit JVM (using this executable http://www.igniterealtime.org/community/docs/DOC-1331 to have it use the 64bit JVM) . We followed the VMware whitepaper that was referenced in this thread, making sure to set the memory reservation for the amount of RAM allocated to the virtual server.

I have been monitoring the JVM-Monitor graphs (attached) and the usage pattern has changed significantly. We still have been seeing spikes in the presencecount…but it seems to recovery. The last spike lasted all night, during off hours, when we have the least connections, which is a little odd.

If everything stays stable, I would like to thank everyone that contributed in helping me out. If there are any comments on the current graphs, please feel free to share.

Thanks
011309.zip (226009 Bytes)

guus · January 13, 2010, 4:44pm

Good news! Thanks for the follow up - this might help others that are (going to) experience similar problems.