Strange Openfire outages/lags - for few minutes once a 10 days?


we are running relative big Openfire instance. There is more than 400 000 registered users, in peaks there is up to 10000 concurent users connected and packet rate is up to 15 000 packets / min. Currently I am using one Openfire server & one or two of Connection manager (depending on load - but it looks like it doesn’t have high impact on performance).

I ran this instance for more than year and half without major problems, current uptime is 280 days. The Openfire version is 3.7.1. But starting since about 2 months ago, the strange things started to happen. Every 10 days or so (sometimes 7 days, sometimes 3 weeks) the customers started to write me, that they cannot connect to server. The problem was for 15 - 30 minutes, then everything started to work normally. I didn’t finds any performance problems - server load was around 10%. Also according to our monitoring, web services on the same server works perfectly, so that it was not network related problem.

When looked to logs, I found huge number of “packet couldn’t be delivered” exceptions in log:

org.jivesoftware.openfire.IQRouter - Error or result packet could not be delivered startMonitoring

Those types of errors are in my logs everyday, if I guess correctly, this can heppen when remote is disconnected without correctly disconnecting from openfire server. But in those cases, when Openfire started to lag or timeouts for users, there is a lot of such errors - like 100 in one minute.

I am trying to figure, what is wrong. I read Openfire achile heel document http://community.igniterealtime.org/docs/DOC-1925 and maybe this is my problem. But there is no solution in this document - what to do with this. I already removed all plugins - I have only Monitoring & User service & Packet filter enabled.

So I want to ask. Will upgrade to 3.8.2 help me with this kind of problem? Because of huge userbase, I want to avoid outages, that will happen during upgrade, so that I’d like to know, if upgrade is worth of it. Also, are there any “secret” properties, that can help me? For example, increase of thread pool or so?