Openfire 3.9.3 dropping all sessions

Every once in a while our openfire instance will drop all connections and not accept new ones - the server will accept tcp connections but there is no reply when testing some XMPP with telnet or netcat. Additionally, when using the admin panel web interface the Session page will not load, it just hangs indefinitely. We are using LDAP auth but the LDAP server does not seem to be the problem since I can log in to the web interface and other services can still access ldap.

There is nothing in the log that indicates a problem (debug logging is enabled now). I am at my wits end with this, I don’t know how to debug the problem further and I’m unable to somehow trigger the problem. I have attached jstack output of a hung process.
jstack.out.zip (6277 Bytes)

There are various bugs with 3.9.3 that could be causing this issue and I suspect that it is fixed in current 3.10.0 RC release. Are you able to test it?

I have now upgraded to Openfire 3.10. The curious thing is that everything worked fine with similar settings on another server (Debian wheezy based, as opposed to the current one running jessie). I have also switched to Oracle Java from OpenJDK.

As for the problem itself, could it be some sort of locking issue with a resource that’s slow/unavailable? It is curious that login and messaging stopped working but the admin panel remains functional, safe for the session list.

Additional detail: it would appear that a connection attempt stalls for ~ 6 minutes:

07:54:53 - Connecting as nils.meyer_ext@sorry.had.to.censor

08:11:26 - nils.meyer_ext@sorry.had.to.censor logged in successfully, online (priority 0).

I have also enabled slow query logging for MySQL to see if there are stalls.

Still no joy, 3.10 had even more sessions broken. It seems that the TCP connection remains but the server is somehow stalled since it does not respond to ping requests or any other XMPP traffic.

What is interesting to see in the thread dump is that there are a lot of threads busy / blocked while doing some kind of LDAP operation. I’m also seeing a lot of roster/group actions, which I assume also interact with LDAP. This is where I’d start investigating. Can you evaluate the logs of your LDAP server, see if anything abnormal sticks out around the times of your outages?

It appears there were a few issues with the connection to the LDAP server, using an in-house mirror fixed the problem for now.