Wildfire 3.2.2 Crashes

St0nkingByte · March 5, 2007, 6:39pm

Since upgrading to 3.2.2 I’'ve had two crashes. One was midday with ~1100 users connected…

An unexpected error has been detected by Java Runtime Environment:

SIGSEGV (0xb) at pc=0x061880b6, pid=17118, tid=2002082736
…

The second was Saturday morning with ~350 users connected and relatively low usage…

An unexpected error has been detected by Java Runtime Environment:

java.lang.OutOfMemoryError: requested 32756 bytes for ChunkPool::allocate. Out of swap space?

Internal Error (414C4C4F434154494F4E0E4350500065), pid=10983, tid=2004179888
…

I’‘ve got a ton of memory to burn on this server so I’‘ve increased my wildfire.vmoptions memory setting to max out at 1.5gb instead of 1gb however I don’'t think this is going to solve the problem.

Does anyone have any suggestions for more intelligent troubleshooting of this problem?

LG1 · March 5, 2007, 9:39pm

Hi,

these are java crashes, so using the latest java version may help.

You could set also some other parameters mentioned in http://wiki.igniterealtime.org/display/WILDFIRE/LinuxInstallationGuide#LinuxInstallationGuide-JVMSettings like

ThreadStackSize, PrintGCDetails and preferIPv4Stack.

These should save some memory and give you a more detailed idea of the used java heap.

LG

matt · March 6, 2007, 3:17am

We’‘ve had a few other bug reports like this, so we’'re looking into some possible causes. More soon…

Thanks,

Matt

St0nkingByte · March 6, 2007, 5:12pm

More random (and possibly useless) feedback.

This morning we had another problem, it happened just before I got into work and my teammates beat on the host a bit before I could get my hands on it. Apparently people couldn’‘t login so they rebooted the box. There wasn’‘t a crash that I can tell. After restarting the service several times we could login to the admin UI but we couldn’'t login with Spark (same accounts). We use Active Directory for authentication.

Since the service was hard down I pulled the trigger on my fail over procedure and tried failing over to my stand by host. This box has an rsync of my entire /opt/wildfire directory. My fail over procedure is basically the upgrade procedure posted here. After the ‘‘failover’’ users are able to login fine to the standby box. I’‘m tempted to say something was wrong with the jvm on the other host but if you guys are saying you are seeing similar problems, that’‘s a bummer. Seems like everytime I’'m about to declare Wildfire production around here I run into a new issue

If there is anything I can do to provide better feedback please let me know.

Thanks.

St0nkingByte · March 6, 2007, 5:19pm

Oops I’'ve discovered why no one could login, one of our guys was confused by my troubleshooting documentation and started a cmanagerd on the primary Wildfire server.

I’‘d like to just retract my previous post because I can’‘t find a crash dump for this morning and my metrics don’‘t show an outage until these guys started messing with it. I think someone was just confused. I’'ll be polishing my docs this morning

Gaston_Dombiak · March 6, 2007, 11:07pm

Hey St0nkingByte,

After doing some research I found in the Openfire code that an IF statement was the other way around. The consequence of this is that Openfire is using Direct Buffers by default instead of Heap Buffers. As a temporary workaround until Openfire 3.2.3 is released you can set the system property xmpp.socket.directBuffer to true so that Openfire will use Heap Buffers.

Let me know if you are still seeing the OutOfMemory errors after doing this change.

Regards,

– Gato

St0nkingByte · March 6, 2007, 11:36pm

Good news.

Will this change take effect on the fly or do we need a restart?

Gaston_Dombiak · March 6, 2007, 11:44pm

Hey St0nkingByte,

You will need to restart the server.

FYI, this change is not going to be enough for a server that is under real real heavy load (i.e. incoming rate is much faster than processing rate and this situation lasts for a long time). Even though you will not get an OOM error like the one you were seeing the server may eventually appear as hang up. The reason for this is that Openfire will queue up packets that are waiting to be processed and when the queue gets really big (e.g. 80K) then the Garbage Collector will try to make some room for new objects but the queue is so big that it will fail.

We are now working on a throttle mechanism that will slow down the incoming and outgoing read/write speed so that queues never grow up and eat up all memory.

Regards,

– Gato