100% CPU Usage

So we have been seeing the issue as well with OpenFire 3.10

VM:

1 core

1gb ram

30gb hdd

Users: 15

Clients: Jitsi (windows), iMessage (osx), Trillian (iOS / Android)

OS: Ubuntu x64 (versions below)

Tested with Ubuntu 12.04 and 14.04, same result.

Tested with Java 7 and 8, same result

Added “-Djava.net.preferIPv4Stack=true”

Helped but same issue happens after longer period of time

Added “-Xms256m -Xmx768m -XX:+UseG1GC”

No issues for 48 hours

This may not work for everyone, we have used some pretty restrictive memory settings but also have a very small system. Its worked for us, until there is some kind of actual fix to the package we will leave it like this. Just thought id post if anyone else wants to test this.

1 Like

Can you keep us in the loop as to whether you experience any problems with users logging in now that you have added “-Xms256m -Xmx768m -XX:+UseG1GC”?

Where (what file) should I add the “-Xms256m -Xmx768m -XX:+UseG1GC” parameter?

Follow up: Nevermind, found it. Tried this parameter and it didn’t work. In fact, things got worse. Messages/iChat on my Mac (OS X Yosemite) made the fan start going nuts.

Just upgraded, started seeing this problem as soon as people started using it.

25126 daemon
20 0 668404 122184 10048 S 99.9 6.5 89:08.71 /opt/openfire/jre/bin/java -server -DopenfireHome=/opt/openfire -Dopenfire.lib.dir=/opt/openfire/lib -cl+

I’ve looked trough the logs, nothing useful.

No fancy plugins, just AD and MySQL setup. I only upgraded because I kept getting reminders from the server to upgrade.

I have since downgraded via yum downgrade

I may have a fix (fingers crossed). I have only tested this with a OS X server. I have not tested this with any other platform.

Openfire 3.10.0

OSX Server

Java Version: 1.8.0_45 Oracle Corporation – Java HotSpot™ 64-Bit Server VM

Plugins installed:Broadcast, Just Married, Monitoring Service, Search, Subscription, User Import Export

**System ran for 5 days with no crashing. Just migrated users over yesterday afternoon. Showing no issues at this time. **

I modified the following file: opener-launchd-wrapper.sh

Original reads:

/usr/bin/java -server -jar “$OPENFIRE_HOME/lib/startup.jar” -Dopenfire.lib.dir=/usr/local/openfire/lib&

Change to:

/usr/bin/java -server -jar “$OPENFIRE_HOME/lib/startup.jar” -Dopenfire.lib.dir=/usr/local/openfire/lib -Djava.net.preferIPv4Stack=true -Xms256m -Xmx768m -XX:+UseG1GC&

1 Like

Thought I had this issue fixed. Server was up 13 days, but then then the CPU pegged and all clients got booted, but the management web interface still functional.

nohup.out attached (generated with ‘kill -3’ after CPU pegged).

CentOS 6.6 VM on ESXi, 2GB RAM, 1 CPU.

OF 3.10.0, embedded database, AD auth, default JVM (1.7.0_76 Oracle). OPENFIRE_OPTS="-Djava.net.preferIPv4Stack=true -Xms256m -Xmx1024m -XX:+UseG1GC"

We currently have 2 instances of Openfire running on 2 separate servers. one is for our internal use and one is for our external clients. We have about 60 users on our server and 25 on the client server. Since the upgrade to 3.10.0 we have had to restart the openfire service on internal client 3 times and the external client twice. They both have the same specs as well. But this time the external client server has been up for 2 weeks without having to restart the openfire service and I have to restart the internal openfire service tonight after only being up for 7 days and %CPU hovering around 198-200% today for java. And the %MEM for openfire is at 35.2% on the internal server and 6.1% on the client server. I have java on both machines set to OPENFIRE_OPTS="-Xms768m -Xmx3096m" so the memory has never been an issue for me.

When i upgraded from 3.9.3 to 3.10 I suddenly started having the cpu% maxed out over time which seems to be the same as everyone else. Currently both machines are running centos-release-6.5.e16.centos.11.2.x86_64. They both are fully updated and patched as well.

I finally saw this issue today.

Running on windows 2012r2/ms sql/ldap

So i restarted the openfire service on both machines last night and already today less than a day later our internal openfire server has java running at %CPU 200. But our other server is still perfectly fine. Its getting very frustrating that this is happening.

Can confirm I am seeing this as well, process jumps to 100% CPU and kicks all users from the server. Its happened 3 times in quick succession today.

Does anyone know if you can safely downgrade to 3.9.3 which seems to be a more stable version according to some posts I’ve seen? This is a new install as of 3.10.0.

I did a yum downgrade successfully myself. The issue was too big.

We saw this issue as well within 24 hours of upgrading. The openfire server was using 100% of a CPU core, and it stopped allowing new connections. It also generated a traceback whenever I attempted to load the sessions page in the web console (though the console otherwise worked).

Server config is:

OS: CentOS 5.11

Openfire: 3.10.0 (from RPM)

JRE: Oracle Java 1.8.0_45.

The previous 3.9.3 was using 1.8.0_45 as well without issues.

Plugins:

Broadcast 1.9.0

Content Filter 1.7.0

Monitoring Service 1.4.2

Packet Filter 3.2.0

Registration 1.6.0

Search 1.6.0

User Import Export 2.4.0

I did not do any further diagnostics and immediately reverted to 3.9.3, at which point the issue was resolved. (I realize that’s not especially helpful, but there are definitely tons of other people sticking with 3.10.0 that can reproduce this issue, I just want to provide my configuration in case it helps.)

Yeah I did the downgrade and everything, so far, is stable and has been working.

Just wanting to add a +1

We’re hitting the same issue. Our 3.9.3 server was running on an old squeeze KVM, with 1 core and 2gb RAM and never missed a beat. Built a new KVM on ubuntu with 2 cores and 4gb of RAM and OpenJDK 1.7.0_79 - it appeared to work fine when switched over one evening, but by the next morning had stopped responding. Have now tried all the suggestions in this thread (garbage collection, memory allocation and disable ipv6) and the problem is persisting on 3.10.0, even with only 15 users - will wait for a point release to try again and for now have rolled back to our older 3.9.3 server.

Happy to provide any further info if anyone needs it for troubleshooting purposes.

Finally wanting to add my +1- I’m also seeing this on one of my 2 Jabber servers (the “active” one in an active-passive pair). In saying that this server isn’t under any real load as thus far it’s only been used by myself for testing and it’s running on a Dell R320 w. 32GB RAM and a quad-core Xeon processor. Operating system is RHEL 6.5 using OpenJDK 1.7.0_79

I’ve currently got the following JVM options sent (most of these are for Garbage Collection):

OPENFIRE_OPTS="-Xmx16G -Xms16G -XX:NewRatio=1 -XX:SurvivorRatio=4 -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:CMSFullGCsBeforeCompaction=1 -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSInitiatingOccupancyOnly -XX:+PrintGCDetails -XX:+PrintPromotionFailure"

I’m considering trying the flag to disable IPv6 but it sounds from reading the rest of the thread that it’s possibly a red herring.

In terms of plugins I’ve got:

Broadcast

Hazelcast

DB Access

MUC Service

Search

I’m using LDAP for users.

Hey guys,

So we ran into this issue as well as we were trying to run some load tests on one of our middleware components that connects users to Openfire. With 3.10 it basically failed to login users after about 5 or 6 and at that point Openfire spins. We can still log in to the admin console but you can no longer get the list of user sessions. When we click on the Sessions tab, it just spins.

Running the same tests on 3.9.3 works fine.

Now since some comments seemed to indicate an issue with the Apache MINA libraries (Openfire 3.10.0 Beta - High CPU usage, https://issues.apache.org/jira/browse/DIRMINA-1011), I went ahead and removed the org/apache/mina folder from the openfire.jar and put the 5 Apache MINA jars into the lib folder directly. I first tried 2.0.9 and saw the same issue, then I tried 2.0.8 and still saw the same issue. Then I tried 2.0.7 and boom, now it’s working properly. Now we’re able to run the load test and get well past a hundred users.

So I’d say this is definitely related to some bugs in the 2.0.8 and 2.0.9 versions of the Apache MINA core library.

EDIT: I created and attached a zip file which contains the modified openfire.jar and the 5 Apache MINA 2.0.7 jars if anyone else wants to try this out. All I did to the openfire.jar is that I unzipped it, I removed the ‘org/apache/mina’ folder, zipped it back up, renamed to .jar and copied it back to the Openfire lib folder. Then I copied the 5 MINA jars there and restarted the server.

If you do this, make sure that you either rename the original openfire.jar to something else so it doesn’t end in .jar (e.g. openfire.jar.original) or move it to a different folder.

6 Likes

Andi,

Thanks for the sleuthing. I’m running your build right now and will report back with my findings. Based on the recent update in the OF-883 issue, it seems very likely that this is caused by Mina or OF’s integration of Mina.

Tom

I wonder if this is related

[DIRMINA-995] Deadlock when using SSL and proxy - ASF JIRA

1 Like

If so, it’s confusing, because that bug is marked fixed, and the comments within indicate they would be releasing MINA 2.0.10 right away.

Looking at git, they were about to release 2.0.10 on December 22nd, then changed their minds and now it’s been 6 months with no changes to the 2.0.x branch:

ASF Git Repos - mina.git/shortlog

It might be an interesting test case to try building the almost-2.0.10 code and see if that addresses the issue similarly to Andi’s trials.

Edit: While I was bouncing around the MINA mailing lists I noticed this:

[DIRMINA-1001] mina2.0.9 session.close cpu100% - ASF JIRA

This has a comment that suggests that closing sessions needs to be handled differently starting with 2.0.9. So perhaps this is related?

With regards to DIRMINA-1001, it’s possible that it’s related but I doubt it because I saw the same issue when using 2.0.8 and it seems that new logic was added in 2.0.9. I might try out a local build of the latest 2.0.10 and see if that works.