All clients get 'Ping Timeouts'

I have gone through all posts on the forum and googled any combination of keywords to figure this one out, but no luck so far.

We have OpenFire 3.6.4 running on a Linux (CentOS 5.2) VM.

We are using AD for user authentication and group sharing.

Have about 90-100 clients connecting at a time (so far).

Clients are Pidgin 2.5.9 and Adium 1.3.6 - possibly others like Meebo, etc

At random times all clients get a ‘Ping Timeout’ message and disconnect (as it seems).

OpenFire web admin is still working fine and the users still show up as being connected.

The logs show nothing indicating a problem, except the occasional:

… RoutingTableImpl: Failed to route packet to JID …

which seems to appear throughout the debug log even when all clients are fine.

If we don’t do anything OpenFire will start responding again after a couple (up to 5) minutes and the clients reconnect.

I usually go in and do a ‘service openfire stop’ to minimize the outage, which actually does not stop the server, which leaves me with having to ‘kill -9’ the process and ‘service openfire start’, which will then have all clients connect again.

We initially had the server internally only, but have moved it into the DMZ 2 months ago, which appears to have made this issue more frequent, although we have been having random disconnects all along.

This can range from once a week to once-twice a day.

We have played around with lots of JVM options to see if that does anything, but so far not much of a difference.

Also have set the ‘xmpp.server.session.idle = -1’, which seemed to have stopped early issues with individual clients dropping and reconnecting, but not the ‘everyone disconnects/gets no response from OpenFire’ issue.

One suggestion on this forum was that large ldap groups and the syncing of them could be causing this.

We have a group of 170 users with contact sharing and are now considering syncing those users to the database and turn off group sharing.

But it does not seem to be a big number of users, considering that there must be bigger companies doing the same with lots more users in a group!?

Also, the server itself is fine during that time, memory, cpu, io are all close to idle, except when all clients reconnect at the same time, which produces a spike.

Is there anything else we might be missing?

Any option to get (even) more verbose logging?

Any way to catch what OpenFire might be doing during that time of unresponsiveness, when a normal ‘stop’ won’t stop the server and a kill is necessary?

Thanks for any suggestions/help!

Jay