Openfire scalability considerations and resources?

Timothee_Besset · February 28, 2009, 9:11pm

Hello,

We are getting very disappointing performance out of our OpenFire server, and we are looking for tools, resources and possibly people to help us diagnose and improve performance.

To give a bit of background, we are running OpenFire 3.6.3 on Debian, on a Dual Quad core IBM blade with 16GB of memory. It is the only app there, and we are having difficulty getting it past 7000 clients.

We think (hope really) that it has to do with the custom plugins we are running. When going past 7000 we start getting storms of disconnects, we can’t maintain a user count above that for any significant amount of time. The plugin [1] is registered as a Component and gets a significant amount of traffic, intercepting everything going to a custom domain.

We think that the processPacket call into that Component has to be the bottleneck. We are doing fairly little processing, but even that may be too much? If the packets processed by the multiple Openfire threads are being queued up for processing into a non threaded call to processPacket into our plugin, we are hitting a pretty basic scalability problem?

We are starting to work on the problem on our own, but would appreciate help and insight you could offer. We may also be wrong about this and find out the problem is somewhere else entirely?

Best,

Timothee Besset

[1] http://www.igniterealtime.org/builds/openfire/docs/latest/documentation/plugin-d ev-guide.html

Bea_Eagle · February 28, 2009, 9:49pm

No easy answers without doing some work. Try JHat to analyse the Java heap and see what objects are taking up all the resources, or

$ kill -QUIT # see states of running java threads

FreeBSD 7.1(+) may be a better selection of OS.

Are you using Multi-User Chat? If so, you can adjust flush of conversation logs objects. Group Chat->Group Chat Settings->Other Settings-> See Flush interval (seconds) & Batch size settings, and Setting ‘Don’t Show History’. Also see, http://www.igniterealtime.org/community/thread/36560

If your plugin dev is buggy, you may need to profile it and see how to improve the code.

HTH,

BEA

Timothee_Besset · February 28, 2009, 11:37pm

Thanks for the pointers,

There is no significant memory consumption or CPU usage when OpenFire starts loosing it’s marbles. Logging on, getting contacts roster, sending custom IQs to the domain controlled by our plugin starts to take longer and longer, until a decent proportion of users get disconnected and things stabilize again. The whole deathspin cycle lasts about 10-15 minutes.

Bea_Eagle · March 1, 2009, 12:50am

While load testing, I did see connections dropping. Suggest the load stats plugin, “The statistic plugin prints usage information of the database connection pool, thread pool used for processing incoming traffic and the NIO networking layer.”

http://www.igniterealtime.org/community/thread/36635

Also, I wonder if this correlates with garbage collection cycles by chance, (e.g adding Java VM flags would display this)?

LG1 · March 1, 2009, 9:03am

Hi Timothee,

use “loadstats.jar” and take a look at JVM Settings and Debugging.

You may want to set

“-Djava.net.preferIPv4Stack=true” “-XX:MaxPermSize=128m” “-XX:+PrintGCDetails -Xloggc:/tmp/gc.log” as JVM parameters (restart required)

If your plugin is using one CPU as it is single threaded then get a javacore with jstack openfire-pid >/tmp/javacore.txt´ and get the CPU used by the plugin threadps -T -p openfire-pid -o pid,tid,pri,time | grep -v '00:00:00’´ as long as your plugins uses a thread to process data.

Enable an option to bypass your plugin, either within the admin console if your plugin has a web UI or by a system property. So you can try to get past 7000 users without it while it’s still installed.

LG

Timothee_Besset · March 1, 2009, 6:00pm

We are seeing the top/main thread really high or pegged at 100% CPU. OpenFire runs about 60 child threads otherwise, which are only a few % CPU each.

jstack would be a tremendous help here, but we can’t get it to produce a stack:

/usr/lib/jvm/java-1.5.0-sun/bin# ./jstack 7030
Attaching to process ID 7030, please wait…
sun.jvm.hotspot.debugger.NoSuchSymbolException: Could not find symbol “gHotSpotVMTypeEntryTypeNameOffset” in any of the known library names (libjvm.so, libjvm_g.so, gamma_g)
at sun.jvm.hotspot.HotSpotTypeDataBase.lookupInProcess(HotSpotTypeDataBase.java:40 0)
at sun.jvm.hotspot.HotSpotTypeDataBase.getLongValueFromProcess(HotSpotTypeDataBase .java:381)
at sun.jvm.hotspot.HotSpotTypeDataBase.readVMTypes(HotSpotTypeDataBase.java:86)
at sun.jvm.hotspot.HotSpotTypeDataBase.(HotSpotTypeDataBase.java:68)
at sun.jvm.hotspot.bugspot.BugSpotAgent.setupVM(BugSpotAgent.java:550)
at sun.jvm.hotspot.bugspot.BugSpotAgent.go(BugSpotAgent.java:476)
at sun.jvm.hotspot.bugspot.BugSpotAgent.attach(BugSpotAgent.java:314)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:146)
at sun.jvm.hotspot.tools.JStack.main(JStack.java:58)
Debugger attached successfully.
jstack requires a java VM process/core!

On a related note, we installed the stats plugin, and whenever the server goes into a deathspin because we allowed too much load on it, we’re seeing the Queue Tasks jump.

LG1 · March 1, 2009, 7:26pm

“jstack would be a tremendous help here, …” ==> On Linux `kill -3 openfire-pid´ will cause the JVM to write one to STDOUT, one will likely find in in nohup.out, stdout.log or somewhere else, depending on your start script.

You get the same information, but it’s a little bit harder to locate it and to identify the start and end, especially if you create more than one stacktrace within a few seconds.

LG

Timothee_Besset · March 1, 2009, 8:59pm

I am using the official debian package, which uses the start-stop-daemon script to spawn OpenFire at boot. I don’t think I have access to stdout anywhere. There is no stdout.log or nohup.log on the system. I will adapt the startup script to make sure stdout is kept around. Won’t be doing that until we can schedule downtime though.

guus · March 1, 2009, 9:43pm

Hi Timothee,

One of the bigger bottlenecks that I’ve encountered at my previous job is the way how Openfire uses threads.

Openfire uses a couple of thread groups to do the major part of all of the work. By default, these groups are configured to be 17 threads in size.

From the top of my head, groups like these exist:

S2S connections;
Clients through NIO;
Clients through old style sockets;
External Components;

Most likely, your clients will either use NIO, or old style sockets. This leaves 17 threads for (almost) all of the work thats related to a client connection. If a client sends a request, one of these 17 threads will pick it up, and will keep the thread busy, until the request has been fully processed (this includes but is not limited to: parsing, authentication verification, privacy list checks, routing, pre-processing event listeners, processing and post-processing).

If anything locks up a thread for a while, you’re opening the possibility of running into issues just like the ones you’re describing: the more users you’ve got connected concurrently, the better the chance that you’ve got a significant part of your 17 threads locked up like that. If most or all of them are locked, you’ve got few or none threads left to do any actual processing. This will lead to very random problems, disconnections, timeouts and a lot of head-ache (I’ve been there). Running Openfire in its clustered setup could actually make this problem worse, as threads tend to wait for an answer from another clusternode.

There’s a couple of things that you can do:

First, make the thread usage visible. They’re implemented using an executor service, which isn’t hard to gather statistisics from (I think there’s even a plugin flying around that does this by printing statistics into a file). Make sure that you monitor usage of your thread groups over time (all the time, preferrably). As a quick and dirty solution, you can have java thread dumps created at the time you’re noticing problems. Look at threads named ‘client-#’, which is the most likely candidate for giving you problems.

If you can successfully verify that it’s threading issues that you’re suffering from, you should be able to identify your current bottleneck by looking at these thread dumps. This should help you fixing your immediate problems. What I’ve experienced though, is that a new bottleneck will pop up right after you’ve fixed the previous one.

The first step towards a somewhat more long-term solution is externalizing all functionality that you can. I’ve found it easiest doing this by implementing most functionality as Components. (Components can usually be rewritten to external components). Most importantly is to release these ‘client threads’ (from the threadpools of 17) as soon as possible. I had rather good results by using a queuing mechanism, where each component would get a queue and its own executor service. The client threads simply hand of the workload, by placing the to-be-processed packet in the component queue.

Good luck and happy hunting!

Timothee_Besset · March 1, 2009, 10:13pm

We think the worker threads are fine. They all hover between 2 and 10% CPU. What we are seeing, is the main thread - the very top level in the process tree - spinning at 100% once we reach high load.

We are implementing our custom functionality with plugins, the primary one is indeed a Component. But maybe you are referring to networked component servers. Someone else suggested this, and that’s pretty compelling, although I’m not sure it changes the fundamentals of those calls being queued up and completely synchronous.

guus · March 2, 2009, 8:31am

Threads do not necessarily have to be utilizing CPU when they’re in a “in-use” state. On the contrary: I’ve found that such locked threads are usually waiting for another event to happen (and, if all other threads are also ‘in-use’, there’s a small chance of that event ever happening…). Simple ‘sleep()’ calls are also implemented in some parts of the code.

I’ve found that the ‘17/17’ syndrome (17 threads in use out of 17 available threads) is a very good indicator of the state of your system. It will give you a far better ‘red flag indicator’ than monitoring CPU or memory alone: These threads are a direct indicator of how much processing power is available to Openfire (where CPU and memory are indirect indicators). Furthermore, monitoring the thread state will allow you to detect when things start to go wrong: watching CPU will warn you when have gone wrong.

I strongly suggest you take the time to look into your threadpool state. Besides, it’s an easy job: you can wrap up a monitoring plugin (or re-use the available one, mentioned earlier in this topic) in a couple of hours. In the mean time, start firing off KILL -3 signals whenever your system is performing poorly again. The resulting thread dumps will most likely give you a good idea where to start looking.