We are currently running version 3.7.1 and are experiencing random CPU spikes that are so severe it will cause clients to become disconnected and clients cannot reconnect. They receive an error that they cannot authenticate. It gets to the point that we are forced to kill the openfire service and restat it because it completely hangs the server. This is very random. Sometimes we can run for several days other times it bombs after a day or two. We have been running this server for almost three years without incident. This started in November and I honestly dont know what changed, I believe we even upgraded to 3.7.1 in hopes that it would help with the problem.
Just last night I rolled our users onto a new server, the difference this time is that I chose to keep the database local. The original was using a remote MSSQL database. I set up the new server as close as possible and then just changed DNS. I do not see anything in logs that indicate a problem. The only exceptions we have from default are:
Kraken IM Gateway version 1.1.2 (MSN Gateway activated)
Monitoring Service version 1.2.0 (for archiving)
I have already seen some high CPU usage that has spiked the CPU for over a minute. And this concerns me. This looked like someone sending a file transfer or something (because I noticed above average network usage) but other times the CPU spikes there is no abnormal network usage.
I am trying to determine where I can begin troubleshooting these issues. We have about 200 users so its not a large environment by any means. As mentioned Openfire ran 100% stable up until around November when we started experiencing this problem.
So my question is, is it possible the archiving feature would be contributing? Are there any known buffer overflow vulnerabilities that a client could run against the server? Any issues with users and their custom avatars? Any other ideas?
I went from 2gb of memory and 1vpu (which ran stable for years) up to 2 vcpu and 4gb to rule out hardware issues. Openfire has become critical to our organization and I need to hash out these stability issues.