We’ve recently finished load testing our Openfire server and after some JVM tweaking we’ve come up with some really good results and JVM findings it might be useful to share.
We load tested using our own live data from the previous chat server (custom, non XMPP) - 2600 players across 53 rooms during our busiest period in a single day. We multiplied this data to simulate extra load, up to 26000 players across 530 rooms to meet the key performance business indicators we had been set. Note that as we increased the load we increased the room count, so maximum in a room was 427 players at any point.
The good news was that Openfire stayed up! BUT with a little bit of JVM tweaking. We logged the time the messages were sent and the time they were received by all clients and graphed the latency. We found that overall the performance was very good except for some massive latency spikes where clients wouldn’t receive messages for up to 5 or so seconds. This tallied up with full GCs - the higher the load the longer the full GC would last for and messages wouldn’t be delivered in this time.
The operations guys here did a sterling job investigating various JVM options and by the end of it had come up with tuning that meant 26000 concurrent users with a 64-bit JVM Openfire meant performed perfectly for our needs. The below is copied directly from the E-mail sent by the operations team, hope you find it useful:
- Instrumentation. It has been invaluable to be able to see the behaviour of the various garbage collectors we have tried. I suggest that the same interface be exposed on the pre-prod and production openfire servers. To enable this, we added the following options to OPENFIRE_OPTS in /etc/sysconfig/openfire.
-Dcom.sun.management.jmxremote.port=8005 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
This access can be secured in production, if needs be, with ssl and with password authentication.
This has allowed us to use the jstat and jconsole tools to attach to the JVM and introspect all kinds of useful information including sizes of various memory generations and default values of -XX type JVM options. I would imagine that it would be easy enough to integrate Cacti also.
- Large Pages. Just about all of the literature I could find on the subject advocates using large pages (referred to as ‘HugePages’ in Linux terminology) with server-side Java. This can reduce the overhead imposed on the OS and the JVM in managing hundreds of thousands of memory pages. Additionally, large pages are ‘pinned’ in physical RAM removing any impact from paging to disk in low memory situations. I’ve successfully configured HugePages on our test server. I’ll email the kernel configuration details separately. The JVM is configured to use large pages by adding the following to OPENFIRE_OPTS.
- Initial testing with the default garbage collector showed that significant ‘stop the world’ GC events were happening. These were taking significant amounts of time (3-4 seconds) during which no other activity in the JVM was taking place. Obviously, during such periods, no chat messages would be sent. Troubleshooting of this activity was facilitated using extended GC logging with the following switches. I would recommend using this in live and pre-prod, since the overhead of this logging is low and the information presented can be useful.
-Xloggc:/opt/openfire/logs/gc.log -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintGCDateStamps
We were able to eliminate these pauses by switching to the new Concurrent Mark Sweep collector for the old generation which can be used in asscociation with the Parallel New Generation collector. Initial testing yielded some instability, but we found that tuning the new generation size in relation to the total heap size eliminated this.
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:NewRatio=2
The test ‘server’ is a single Core 2 Duo with 4GB running 64-bit Centos, JDK version 1.6.0_u20 (x86-84).
If anyone wants some more details or to see the test results just pm me…