New Openfire 3.6.3 server using 90+ percent of CPU

John_May1 · May 5, 2009, 6:53pm

We have been using Wildfire/Openfire for several years with Exodus on the clients. We recently out grew our Openfire 3.5.2 server running on Windows 2003 on a MS Virtual Server. We currently have about 350 concurrent users. I would like to get that up to 2100.

I decided to install Openfire 3.6.3 on a HP Proliant DL380-G3 Dual XEON 2.8Ghz 2 GB RAM still running Windows Server 2003 SP2. We are using Windows Active directory for authentication. Last week we tried to have everyone start using the new server. I gave it the same Openfire server name as the server it was replacing (the Windows name is different we use a CNAME) and overnight I changed the CNAME and everbody (about 350 users) was pointed at the new server. Within about 90 minute the CPU went to 100% and then the server quit accepting new logons and I could no longer logon to the Admin console. I switched the CNAME back to the old server.

I started over with the new server. I re-installed Openfire on the new server and gave it a unique Openfire server name and this morning started moving people in small groups to the new server. We had them delete the old Exodus profile and created a new one for the new server. When we got to about 50 users the CPU was at or near 100% (90+% was the openfire-service.exe). We stopped migrating users for now. I restarted the service at noon and we are up to 43 users and the CPU is holding at an average of 56%. Java memory is at 614MB of 986 MB. Before I restarted the service I made two changes. I took the ![CDATA[…]] off of the ldap.vcard-mapping setting since I saw it was not need and changed the Java memory from 1024-1276 to 768-1024. I’m not sure if that helped.

Other setting I change are as follows:

cache.userCache.size=2097152

cache.username2roster.size=5242880

cache.vcardCache.size=1048576

What can I tune to get this server up to its full potental. By the way Java version 1.6.0_03 (came with the Openfire install.).

John_May1 · May 6, 2009, 6:23pm

This morning the users started logging on and about 50 were on and the CPU was at 40-50%. We added another 50+ users and as soon as the 1st few logged on the CPU went to 99%. The CPU stays a 99% but the users are not having any problems and the Openfire Admin Console seems very responsive. I logged on to the servers Windows console and it seem OK.

I am baffled.

Any help?

LG1 · May 7, 2009, 7:18pm

Hi John,

it may help if you could provide a javacore. See for JVM Settings and Debugging details, enabling a GC log and looking for a lot of GC’s or long compilation times could also help.

LG

John_May1 · May 7, 2009, 8:24pm

I have not used Java based server apps much in the past so I am not familiar with the debug tools. From the instruction you referenced I downloaded StackTrace and ran a trace on the PID for openfired.exe. It is attached. I have no clue on how to “read” it.
stacktrace1.txt (62061 Bytes)

LG1 · May 8, 2009, 4:34pm

Hi John,

please also run “pslist.exe -dmx Openfire >c:\temp\openfire.threads.txt” two times with a 1 minute delay between the runs as described in the document. This allows to identify which threads use so much CPU.

LG

John_May1 · May 11, 2009, 2:47pm

Attached is the pslist output. I will also try to attach the GC trace collected on Friday.

Thanks!
openfire.threads.txt (11765 Bytes)
gclog.txt (8186856 Bytes)

LG1 · May 11, 2009, 7:08pm

Hi John,

did you restart Openfire between taking the javacore file and the pslist? It seems that one can not match the pslist-Tids to the javacore-nids. So if you could provide a pair of javacore and pslist outputs one should be able to identify the problem.

In both pslists one can may note this line:

2060 13 3593940 Running 0:25:01.703 0:19:13.953 3:13:17.066

2060 13 3605904 Running 0:25:21.203 0:19:26.015 3:14:17.628

2060 => 80c but I can’t find nid=0x80c in the javacore.

So this thread is running both times and it is consuming CPU.

LG

John_May1 · May 12, 2009, 2:45pm

Sorry about that. I am a little slow on the uptake.

My trial period for Stacktrace ran out. Will the jstack.exe output work as well? I will switch from running as a Windows service to starting it with the launcher tonight since it appears the jstack command line utility won’t work with a Windows service.

I was wondering, would upgrading to 3.6.4, now that it is out, be a good idea?

LG1 · May 13, 2009, 4:19am

Hi John,

upgrading to 3.6.4 is recommended as it fixes a critical bug (“Prevent users from changing other users passwords”). Anyhow it will likely not fix this problem.

LG

PS: One day I’ll post a javaagent which allows to take a javacore together with thread timing informations.

John_May1 · May 13, 2009, 2:19pm

Again thanks for the help.

Here are the outputs from pslist (2 x 1 minute apart) and jstack.
jstack_openfired_pid_2332.txt (47721 Bytes)
pslist_openfired.txt (11923 Bytes)

LG1 · May 15, 2009, 8:01pm

Hi John,

The thread which is looping is a MINA thread, so I wonder how much Openfire code is involved. I wonder whether this is a MINA bug or whether http://jira.codehaus.org/browse/JETTY-380 could be related. So I have no idea how to reproduce or to fix this issue.

Are you using BOSH / web based clients?

LG

PS: I did take a look at these lines in your dumps:

Tid Pri Cswtch State User Time Kernel Time Elapsed Time

3680 13 5291659 Ready 0:25:17.625 0:18:53.500 2:52:32.483

3680 13 5309113 Running 0:25:37.609 0:19:07.296 2:53:32.607

1904 13 4924144 Ready 0:34:30.093 0:23:02.062 2:48:26.488

1904 13 4938673 Ready 0:34:55.718 0:23:15.671 2:49:26.612

Elapsed Time: 0:01:00 - Kernel Time: 0:00:27 - User Time: 00:00:45 - User+Kernel Time: 00:01:12

With 4 cores this means that these threads cause 28% CPU usage. I did ignore the other two SocketAcceptorIoProcessor threads which do cause additional CPU usage.

“SocketAcceptorIoProcessor-0.0” prio=6 tid=0x49083c00 nid=0xe60 runnable [0x4aadf000…0x4aadfc98] and

“SocketAcceptorIoProcessor-0.1” prio=6 tid=0x48e93400 nid=0x770 runnable [0x498af000…0x498afa18]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method)
at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll(Unknown Source)
at sun.nio.ch.WindowsSelectorImpl$SubSelector.access$400(Unknown Source)
at sun.nio.ch.WindowsSelectorImpl.doSelect(Unknown Source)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source)
- locked <0x08484fe0> (a sun.nio.ch.Util$1)
- locked <0x08484fd0> (a java.util.Collections$UnmodifiableSet)
- locked <0x083d7f80> (a sun.nio.ch.WindowsSelectorImpl)
at sun.nio.ch.SelectorImpl.select(Unknown Source)
at org.apache.mina.transport.socket.nio.SocketIoProcessor$Worker.run(SocketIoProce ssor.java:480)
at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

John_May1 · May 19, 2009, 1:38pm

We don’t use a web based client. All the users are on Exodus 0.9.1.0 or Exodus 0.10.0.0.

We peak out at about 350 users midmorning. The CPU hits 100% when about 180 users are connected and there are about 32 to 40 conversations active. I loaded the monitoring.jar last night and have not seen any problems, the system seems to be responding well to the Adminisration console even with the CPU at 100%. When I RDP to the server the system is surpisingly responsive. Usually when there is high CPU the server console and the Openfire Admin Console are painfully slow, both are doing well and the users are not compaining of being dropped all the time like the old server.

BTW I upgraded to 3.6.4 and like you said no change.

John_May1 · June 16, 2009, 5:44pm

I think I found a contributing factor. Our users were tryin to add people to their rosters and they would guess the JID or they would add people that never or rarely log in to Openfire. With all these “pending” roster requests it appears the server was overloaded. I filtered the openfire.script and found the problem entries and using the admin console I deleted them from the users’ rosters. After deleting a few hundred of these, the CPU utilization has been reduced, it is still high, but much better.

I am recommending some basic user training. This is complictated by the fact that we have a mix of Exodus v0.9 and v0.10 clients. We will be upgrading everyone to v0.10 and we will have them configure Exodus to display “pending” roster entries, with instructions to delete them after a few days. We will also tell people to use search to find people and not just guess.

akrherz · June 16, 2009, 9:21pm

John,

Were they adding valid Jabber IDs, or just the bare username?

daryl

John_May1 · June 17, 2009, 5:07pm

We had a mixture of ID’s. Our JID is 1st initial, middle initial and 6 character of the last name. Bill T. Smith would be WTSmith@im.ourdomain.net or Fred Johnson (no middle inital) would be F_Johnso@im.ourdomain.net. We are getting things like Fred or Fred%20Johnson@im.ourdomain.net. Sometime they would put the persons SMTP email address, that would be WilliamSmith@ourdomain.com. The bottom line is they are not correct and would stay “pending” forever.

akrherz · June 17, 2009, 5:12pm

My guess is that openfire is attempting to make server 2 server connections for each of those domains and that is creating artificial load. I am probably off base though. There are open Jira tickets on openfire not handling invalid JIDs correctly

daryl

LG1 · June 17, 2009, 6:54pm

Hi Daryl,

I did try this with a domain “a.b.c” no success. The s2s connection did simply fail. It did log a lot of “OutgoingSessionPromise: Error sending packet to remote server” messages.

One interesting thing is that removing “foo@b.c” from the roster prints in the debug log “#n registered a statement as closed which wasn’t known to be open. This could happen if you close a statement twice.”.

LG