Results from our Openfire load testing

Hi all,

We’ve recently finished load testing our Openfire server and after some JVM tweaking we’ve come up with some really good results and JVM findings it might be useful to share.

We load tested using our own live data from the previous chat server (custom, non XMPP) - 2600 players across 53 rooms during our busiest period in a single day. We multiplied this data to simulate extra load, up to 26000 players across 530 rooms to meet the key performance business indicators we had been set. Note that as we increased the load we increased the room count, so maximum in a room was 427 players at any point.

The good news was that Openfire stayed up! BUT with a little bit of JVM tweaking. We logged the time the messages were sent and the time they were received by all clients and graphed the latency. We found that overall the performance was very good except for some massive latency spikes where clients wouldn’t receive messages for up to 5 or so seconds. This tallied up with full GCs - the higher the load the longer the full GC would last for and messages wouldn’t be delivered in this time.

The operations guys here did a sterling job investigating various JVM options and by the end of it had come up with tuning that meant 26000 concurrent users with a 64-bit JVM Openfire meant performed perfectly for our needs. The below is copied directly from the E-mail sent by the operations team, hope you find it useful:

  1. Instrumentation. It has been invaluable to be able to see the behaviour of the various garbage collectors we have tried. I suggest that the same interface be exposed on the pre-prod and production openfire servers. To enable this, we added the following options to OPENFIRE_OPTS in /etc/sysconfig/openfire.

-Dcom.sun.management.jmxremote.port=8005 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false

This access can be secured in production, if needs be, with ssl and with password authentication.

This has allowed us to use the jstat and jconsole tools to attach to the JVM and introspect all kinds of useful information including sizes of various memory generations and default values of -XX type JVM options. I would imagine that it would be easy enough to integrate Cacti also.

  1. Large Pages. Just about all of the literature I could find on the subject advocates using large pages (referred to as ‘HugePages’ in Linux terminology) with server-side Java. This can reduce the overhead imposed on the OS and the JVM in managing hundreds of thousands of memory pages. Additionally, large pages are ‘pinned’ in physical RAM removing any impact from paging to disk in low memory situations. I’ve successfully configured HugePages on our test server. I’ll email the kernel configuration details separately. The JVM is configured to use large pages by adding the following to OPENFIRE_OPTS.

-XX:+UseLargePages

  1. Initial testing with the default garbage collector showed that significant ‘stop the world’ GC events were happening. These were taking significant amounts of time (3-4 seconds) during which no other activity in the JVM was taking place. Obviously, during such periods, no chat messages would be sent. Troubleshooting of this activity was facilitated using extended GC logging with the following switches. I would recommend using this in live and pre-prod, since the overhead of this logging is low and the information presented can be useful.

-Xloggc:/opt/openfire/logs/gc.log -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintGCDateStamps

We were able to eliminate these pauses by switching to the new Concurrent Mark Sweep collector for the old generation which can be used in asscociation with the Parallel New Generation collector. Initial testing yielded some instability, but we found that tuning the new generation size in relation to the total heap size eliminated this.

-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:NewRatio=2

The test ‘server’ is a single Core 2 Duo with 4GB running 64-bit Centos, JDK version 1.6.0_u20 (x86-84).

If anyone wants some more details or to see the test results just pm me…

Thank you so much for sharing this! Question: which version of openfire did you test against? If it was 3.6.4 , could you please test again against trunk or when the 3.7.0 beta comes out really-soon-now?

daryl

This is a very interesting read! Could I persuade you to compose some sort of whitepaper, that digs into the details a bit further?

Could you elaborate a little on the Openfire features that your clients are using (groupchat = MUC?), what kind of contact lists they have, what your database setup is, what kind of hardware you’re using, and the like?

What’s your product for which you are using Openfire? (Shameless plug of your company is ok! ) I’m very interested in reading up on innovative ways in which people use our product.

Hi Daryl,

Yep, against 3.6.4. We’re running 3.6.4 live and chances are expecting a bit push in activity while we’re still on 3.6.4. But hopefully we’ll be on 3.7.0 in the next couple of months and all the load testing infrastructure is now there to be able to run the tests. We’ll post our next findings against 3.7.0 beta…

Hey Guus,

Ok - Openfire setup:

  • Content filter plugin for swear filtering

  • Custom auth provider

  • No rosters (yet!)

  • Focused entirely on MUC

  • MySQL database on the same server (had to tweak the flush interval - now 20 secs, batch size 500) consuming very little memory

  • Hardware for the load test is a single Core 2 Duo with 4GB running 64-bit Centos, JDK version 1.6.0_u20 (x86-84).

We’re a gaming company - Openfire powers a group chat feature that co-exists next to the games to foster a user community.

We’ve (kind of) added multi domains but it’s not an elegant solution as we use a single openfire for the many brands that have chat running on them.

One suggestion is to compare the results with a FreeBSD OS, as Java is reputed to run many times faster in comparison.