I’‘ve got a ton of memory to burn on this server so I’‘ve increased my wildfire.vmoptions memory setting to max out at 1.5gb instead of 1gb however I don’'t think this is going to solve the problem.
Does anyone have any suggestions for more intelligent troubleshooting of this problem?
This morning we had another problem, it happened just before I got into work and my teammates beat on the host a bit before I could get my hands on it. Apparently people couldn’‘t login so they rebooted the box. There wasn’‘t a crash that I can tell. After restarting the service several times we could login to the admin UI but we couldn’'t login with Spark (same accounts). We use Active Directory for authentication.
Since the service was hard down I pulled the trigger on my fail over procedure and tried failing over to my stand by host. This box has an rsync of my entire /opt/wildfire directory. My fail over procedure is basically the upgrade procedure posted here. After the ‘‘failover’’ users are able to login fine to the standby box. I’‘m tempted to say something was wrong with the jvm on the other host but if you guys are saying you are seeing similar problems, that’‘s a bummer. Seems like everytime I’'m about to declare Wildfire production around here I run into a new issue
If there is anything I can do to provide better feedback please let me know.
Oops I’'ve discovered why no one could login, one of our guys was confused by my troubleshooting documentation and started a cmanagerd on the primary Wildfire server.
I’‘d like to just retract my previous post because I can’‘t find a crash dump for this morning and my metrics don’‘t show an outage until these guys started messing with it. I think someone was just confused. I’'ll be polishing my docs this morning
After doing some research I found in the Openfire code that an IF statement was the other way around. The consequence of this is that Openfire is using Direct Buffers by default instead of Heap Buffers. As a temporary workaround until Openfire 3.2.3 is released you can set the system property xmpp.socket.directBuffer to true so that Openfire will use Heap Buffers.
Let me know if you are still seeing the OutOfMemory errors after doing this change.
FYI, this change is not going to be enough for a server that is under real real heavy load (i.e. incoming rate is much faster than processing rate and this situation lasts for a long time). Even though you will not get an OOM error like the one you were seeing the server may eventually appear as hang up. The reason for this is that Openfire will queue up packets that are waiting to be processed and when the queue gets really big (e.g. 80K) then the Garbage Collector will try to make some room for new objects but the queue is so big that it will fail.
We are now working on a throttle mechanism that will slow down the incoming and outgoing read/write speed so that queues never grow up and eat up all memory.