I have gone through all posts on the forum and googled any combination of keywords to figure this one out, but no luck so far.
We have OpenFire 3.6.4 running on a Linux (CentOS 5.2) VM.
We are using AD for user authentication and group sharing.
Have about 90-100 clients connecting at a time (so far).
Clients are Pidgin 2.5.9 and Adium 1.3.6 - possibly others like Meebo, etc
At random times all clients get a ‘Ping Timeout’ message and disconnect (as it seems).
OpenFire web admin is still working fine and the users still show up as being connected.
The logs show nothing indicating a problem, except the occasional:
… RoutingTableImpl: Failed to route packet to JID …
which seems to appear throughout the debug log even when all clients are fine.
If we don’t do anything OpenFire will start responding again after a couple (up to 5) minutes and the clients reconnect.
I usually go in and do a ‘service openfire stop’ to minimize the outage, which actually does not stop the server, which leaves me with having to ‘kill -9’ the process and ‘service openfire start’, which will then have all clients connect again.
We initially had the server internally only, but have moved it into the DMZ 2 months ago, which appears to have made this issue more frequent, although we have been having random disconnects all along.
This can range from once a week to once-twice a day.
We have played around with lots of JVM options to see if that does anything, but so far not much of a difference.
Also have set the ‘xmpp.server.session.idle = -1’, which seemed to have stopped early issues with individual clients dropping and reconnecting, but not the ‘everyone disconnects/gets no response from OpenFire’ issue.
One suggestion on this forum was that large ldap groups and the syncing of them could be causing this.
We have a group of 170 users with contact sharing and are now considering syncing those users to the database and turn off group sharing.
But it does not seem to be a big number of users, considering that there must be bigger companies doing the same with lots more users in a group!?
Also, the server itself is fine during that time, memory, cpu, io are all close to idle, except when all clients reconnect at the same time, which produces a spike.
Is there anything else we might be missing?
Any option to get (even) more verbose logging?
Any way to catch what OpenFire might be doing during that time of unresponsiveness, when a normal ‘stop’ won’t stop the server and a kill is necessary?
Thanks for any suggestions/help!