Dramatic CPU Spiking

Hi Gang,

We are currently running version 3.7.1 and are experiencing random CPU spikes that are so severe it will cause clients to become disconnected and clients cannot reconnect. They receive an error that they cannot authenticate. It gets to the point that we are forced to kill the openfire service and restat it because it completely hangs the server. This is very random. Sometimes we can run for several days other times it bombs after a day or two. We have been running this server for almost three years without incident. This started in November and I honestly dont know what changed, I believe we even upgraded to 3.7.1 in hopes that it would help with the problem.

Just last night I rolled our users onto a new server, the difference this time is that I chose to keep the database local. The original was using a remote MSSQL database. I set up the new server as close as possible and then just changed DNS. I do not see anything in logs that indicate a problem. The only exceptions we have from default are:

Kraken IM Gateway version 1.1.2 (MSN Gateway activated)
Monitoring Service version 1.2.0 (for archiving)

I have already seen some high CPU usage that has spiked the CPU for over a minute. And this concerns me. This looked like someone sending a file transfer or something (because I noticed above average network usage) but other times the CPU spikes there is no abnormal network usage.

I am trying to determine where I can begin troubleshooting these issues. We have about 200 users so its not a large environment by any means. As mentioned Openfire ran 100% stable up until around November when we started experiencing this problem.

So my question is, is it possible the archiving feature would be contributing? Are there any known buffer overflow vulnerabilities that a client could run against the server? Any issues with users and their custom avatars? Any other ideas?

I went from 2gb of memory and 1vpu (which ran stable for years) up to 2 vcpu and 4gb to rule out hardware issues. Openfire has become critical to our organization and I need to hash out these stability issues.

Thoughts?

1 Like

I would try disabling the plugins for some time (if you can). You said you went to local database. But is it still MSSQL?

So far I have disabled the archiving plug-in (even though we are required to archive our conversations). Too early to tell obviously but that seems to have reduced the load. As for the database, I chose to go with the included integrated database. I do not suspect we are a large enough user base to exceed the capabilities. Wanted to try and rule out as much as possible.

I had come across a few previous posts complaining of similiar issues but they more seemed to do with the PEP plugin. Apparently this problem did not exist with version 3.6.x.

Anyhow, running with archiving disabled currently so we’ll see.

This has happened again. For the past twenty minutes the CPU has been spiked at 100%. The archiving feature is and has been turned off prior to this occuring.

Is there any log I can send in that may indicate where the problem lies?

Enable the JVM option to log GC details (-XX:+PrintGCDetails). During (Full) GC runs the usually all CPUs are used and the application does not respond. If you have a memory issue you will see a lot of long running Full GCs. See also http://community.igniterealtime.org/docs/DOC-1033

Using the integrated database with archiving is not a good idea. DB will grow and this db is always held in the memory, so memory consumption will grow. So GC will take longer and probably this can cause CPU spikes. GC - garbage collection process to check the memory and free it from unneeded stuff.

How would I go about setting this on a windows server? I cannot find a valid / definitive instruction to implement this on windows.

Thanks wroot. I do realize this, hence as I stated above I have removed the archiving. Prior to this last install we WERE using a remote MSSQL database with archiving enabled and this was still a problem. This problem did not appear until we upgraded to 3.7.x

The intention with this clean install was to rule out as many variables as possible. If/when I am confident everything is working properly/stable again I intend on moving back to the remote DB. Just want to get things running smooth first.

We are experiencing the exact same issue described by a_user. I am running Openfire 3.7.0 on a Windows 2008 R2 VM server with 4GB of memory. We only have about 60 users on this system on a daily basis so the hardware should be more than sufficient. We are also using the local built-in DB option rather than a remote MSSQL box.

I have 2 similar, but I believe separate issues.

  1. Java Memory Usage: I have updated the vmoptions file to increase the Java memory to 1.5gb, which helps the memory last longer but there still seems to be some type of memory leak because it is constantly consuming more and more memory. We definitely see the Java memory usage growing over the course of a week and eventually it will get up to 100% and users will not be able to log in. To prevent this, I put in a script to monitor the memory usage each evening and restart the openfire-server service if it reaches a certain threshold. This seems to help keep us from hitting the 100% memory usage.

  2. 100% CPU: This is the same issue as a_user is reporting. Randomly, the openfire-service service will consume 100% of the CPU and the server will become unresponsive. We could go up to a week without any issues, but usually it seems to last only a few days before the issue recurs. The only options are to kill the service and restart it or reboot the server. I only have the following plug-ins installed on this box: Monitoring Service, Search, User Import Export, and User Service. I do have archiving enabled as well, but based on a_users findings, that does not seem to be the cause. On a typical day when the system is working correctly, the CPU is usually only between 1% and 10% with an occassional spike to 30-40% but that is only for a couple seconds and right back to 1%. I have combed the Event logs and Jabber logs and have not been able to find anything that seems to coorelate to the issue we are seeing. This is becoming really frustrating for both the users and myself.

Any help you can provide would be greatly appreciated.

James

  1. Have you tried disabling the PEP service?

From the announcenemt abov the forums:

Openfire up to and including version 3.6.4 (and looks like 3.7.0 too) suffers from a memory leak in its PEP component. If your Openfire server is crashing with OutOfMemoryExceptions, you might be having this problem.

As a workaround, you can disable PEP, by setting the Openfire property xmpp.pep.enabled to false.

James, do you have the ability to look at you database? I have a theory about the memory usage problems and I would like some help.

Can you find out how many rows you have in the ofPubsubItem table, and whether they have payload associated with them?

Thanks.

I just added the xmpp.pep.enabled = False setting this morning so we’ll see if that makes any difference with regards to the memory. Thanks.

I do have full access to this server and the embedded openfire.script database, but I do not know how to access the database file. From the documentation I could find, it looks like the database is HSQL? I don’t have that installed anywhere on this server so I’m not sure how to browse the contents of the file to look at the ofPubsubItem table. If you can provide any instructions on how to do this on a Windows system, I will gladly take a look for you.

Thanks.

First you have to shut down the server, since it is an embedded db.

In your Openfire install, there is a /bin/extra directory with an batch file called embedded-db-viewer.bat. You can run this and it will provide a simple UI to access the database.

I don’t know how much sql you know, but type select count(*) from ofpubsubitem in the upper right pane and press **Execute SQL **This will tell you how many rows you have.

You will have to make a small change first though, I noticed that some of the jar files in the /lib directory are packed, and will need to be unpacked before you can run the tool. This will require a JDK so you can use the unpack200 tool.

You can also make a copy of the file while openfire is running, open the copy and count the number of ofpubsubitem lines. How small/big are the embedded-db/openfire.* files?

It’s been 11 1/2 days since I did the “xmpp.pep.enabled = False” setting on my Jabber server and the Java Memory has remained stable as has the Openfire service. I have not done the DB query requested above yet only because the system has stayed running well to this point. If I have another instance where the service goes to 100% CPU and hangs there, I will check out the DB and supply the requested information.

Thanks for all the help.

James

Hi James,

Did you simply add: “xmpp.pep.enabled = False” to the system properties page in the admin console and than restart the service? Trying to understand specifically where to add this variable.

Yes, that’s exactly what I did.

You just scroll to the bottom and add “xmpp.pep.enabled” to the Property Name field and “false” to the Property Value field, then hit Save Property.Then restart the server.

Okay. This is what I did. Thanks guys.