OpenFire Single Server Load Testing 120,000 Concurrent Connections

Update 12/18/2015 (This original test was for 80,000 users online concurrently. Update is now 120,000 but with no rosters.)

Openfire 3.10.3

(These servers are 6 years old)

OpenFire Server specs: Windows 2012 R2 Server. 64 bit java.
Running OpenFire as a background service. Dual CPUs (Intel Xeon L5420 2.50GHZ. 24GB RAM)

SQL Database Server: Windows 2008 R2 with SQLExpress 2012 (free version of SQL). 8GB RAM Single Xeon E5420 CPU 2.50GHZ

Both servers are using Samsung 850 120GB SSD hard drives. The SQL server has an additional 120GB SSD (D Drive) for the openfire SQL database.

Both servers are not using RAID currently just single drives.

OpenFire Server:

NIC#1 WAN IP Address (3 Linux CentOS Tsung test servers connecting to this IP for the load testing)

NIC#2 LAN IP Address (Used to connect to a Windows 2008 R2 SQLExpress database for OpenFire via LAN in the same room)

SQL server showed almost no cpu usage the entire time. It appeared to be idle for the most part. It handled all requests easily.

Test #1

5-15 roster ā€œfriendsā€ for 40k random users. No rosters for other 40k random users.

15 logging in per second with initial presence and roster updates.

72,000 online = 32% cpu

I logged in with spark on my own PC quickly, no delay and roster was updated right away.

Test #2

5-15 roster ā€œfriendsā€ for 40k random users. No rosters for other 40k random users.

22.5 users logging in per second with initial presence and roster updates.

10k online 5% cpu

20k online 10% cpu (2gb ram used for java in openfire)

30k online 15% cpu (2gb ram used for java in openfire)

40k online 18% cpu (2gb ram used for java in openfire)

79k online 50% cpu (2gb ram used for java in openfire)

80k online 4% cpu after no more users logging in and no more roster/presence updates

I logged in with spark on my own PC quickly, no delay and roster was updated right away.

Same as above, just higher cpu usage

5-15 rosters for 40k users. No rosters for other 40k

30 logging in per second with initial presence and roster
updates

Tested once and was using just over 2GB RAM used for java in
openfire. 2nd test used 4GB RAM used for java in openfire

79k online 70% cpu (2gb ram used for java in openfire.

80k online 4% cpu after no more users logging in and no more roster/presence updates

4GB on a second time running this test. OpenFire cache was using this extra RAM so I believe the previous tests caused this memory to be used and not released.

After I stopped the OpenFire service and restarted it, the cache cleared and it ran again with 2GB being used

I logged in with spark on my own PC quickly, no delay and roster was updated right away.

11.png

Settings:

CONCLUSION:

If you can manage how many client connections per second are connecting to your server, even some old hardware like I have will do just fine. I should easily be able to do over 150k or 200k assuming I can manage the speed in which users are trying to reconnect which in my case I can because I have a custom jabber client program that all of my users will be using. I wrote a program similar to TeamViewer and Logmein and I use OpenFire to handle the accounts and use the awesome REST API via my webserver where users can sign up and pay for my remote control pc software and choose a plan such as 5 PC license, 10 PC license etc. When they create their account the php script uses REST to send the proper command to the OpenFire server which creates and manages the users and their rosters. I also allow people to share their PCs with other users via the website which again uses REST to modify the rosters of those accounts.

I wanted to see if OpenFire had what it takes to do what I needed without having to purchase a bunch of servers frequently as the user base grew.

I have 4 of these servers and setup the cluster which was also working but I only had 3 additional servers which had Linux CentOS and Tsung installed for the load testing. Each of those servers seemed to cap out at around 27,000 test users concurrently which is why I could only get around 80k in this test. The testing above was not on the cluster, it was only on a single server.

So if you had 4 of these servers ($150 on ebay) and an additional server for load balancing such as one with Linux and HA Proxy, you could send a lot more requests per second.

The other thing to note is that I did a test with 150 users per second logging on and I think by 25k or so online the cpu usage was getting very high and hit 100% soon after if I recall. This was on the same exact server mentioned above.

It is these connections per second that really put pressure on the server and Iā€™m sure if they had even larger rosters, that would compound the issue and add more stress. If you could find a cheap 16 core or 24 core server that is also several years old, I think it would be able to handle 150-200 connections per second with the test I did above. If I had one of these servers I would test it but I do not.

4 Likes

Thank you for sharing. This is very interesting information.

Iā€™m always a little dubious about what benchmarks are really showing, but these are at least showing that Openfire can sustain nearly 80k users. Itā€™d be nice to know what the latency was like once the system had that many users on it, and what levels of activity it can withstand, though.

But most of all, is there any chance you could help out with this kind of testing with the 4.0 branch? Weā€™re moving to Beta today, and itā€™d be fantastic to get this kind of load testing done on it over the next couple of weeks as we move to the final release.

Please drop into the chatroom and introduce yourself - youā€™ll be made very welcome.

Dave.

I was using Spark on my desktop PC and created an account with a few roster users and periodically while the server was being load tested, I would log in and then out with the spark client and it logged in very fast and updated the roster very quickly like you would want. Very fast.

I also tested when the CPU started hitting 100% and that is when my spark client had trouble logging in. I could not even log in, it would time out. It seemed like as long as the cpu was not stressed, I could log in and out easily.

Heading out to lunch but where is the chat room you speak of?

And yes I would gladly benchmark 4.0. I can use the other 3 servers in the cluster with new hard drives with Linux/Tsung so I should be able to do around 160k users for that test.

our XMPP multiuser chatroom: open_chat@conference.igniterealtime.org

This is Very Interesting Information

UPDATE:

120,000 concurrent users on the same single machine mentioned above. These users had no rosters but I can tell you from testing, it seems like the only thing that matters is how many connections per second were taking place as far as what really stresses out the system. From 0 to 80,000 I was doing 30 logins per second. This one done with 3 tsung servers each logging in 10 users per second.

After 80,000 were online, cpu went to almost nothing. Then I put 2 more tsung servers online to load test (3 idle 80k connections, 2 adding more users) which would mean 20 users logging in per second. The cpu usage went up immediately to over 60% and reached 90% and briefly 100% once I got to 117,000 users online. I let it keep going until 120,000 concurrent connections had been made and it was touching 100% cpu more frequently and was over 90% cpu now continuously.

I logged in with Spark multiple times throughout the entire test and was able to log in very fast and updated the roster just as fast as if there was no stress on the system.

My original test above had considerable rosters and I can tell you even then I was able to login fast without any issues but that was with 80,000 users and 40k had 5-15 rosters each. This test right now was still with version 3.10.3

UPDATE: Load tested a new server. Windows 2012 R2 16GB RAM. SSD hard drive. Everything else such as SQL server was the same.

New server specs: Single Intel i7 4790k CPU 4.0GHZ

OpenFire java used less than 2GB RAM the whole time. My cache settings were the same as above although for some reason it showed:

Java Memory: xxx MB of 3572.00 MB. Java memory never used even 2GB. This might be because of my cache settings but I am not sure.

Test# 1: 3 tsung load test servers logging in 30 per second (See above, original test where I did 30 per second on the dual xeon server. Notice the cpu usage)

Results: 70k users online while still logging in 30 per second just like the original test = 70k online 30% cpu.

Test# 2: 3 tsung load test servers logging in around 92 per second:

Results:

29874 online users = 33% cpu = 94 per sec

34630 online users = 42% cpu = 79 per sec

40057 online users = 46% cpu = 90 per sec

45477 online users = 56% cpu = 90 per sec

51000 online users = 74% cpu = 92 per sec

56092 online users = 78% cpu = 85 per sec

61455 online users = 85% cpu = 89 per sec

67090 online users = 90% cpu = 94 per sec

72858 online users = 95% cpu = 96 per sec

77207 online users total. cpu went down to idle.

CONCLUSION: Modern hardware, (we used the single cpu i7) = WOW VERY GOOD.

Just tested OpenFire beta 4.0.0

Same results as the post directly above this which was with version 3.10.3. Cpu usage is the same. Online users the same.

I just ran another test with OpenFire beta 4.0.0 and the same Intel i7 CPU mentioned previously.

This time I lowered the amount of logins per second using the same 3 tsung servers. Average of around 70 logging in per second.

I lowered the amount of logins per second because I added 50-80 roster friends per user for 30,000+ users. So each of the users have somewhere between 50 and 80 roster friends.

The first 30,000+ logged in users all have these rosters. After they are logged in, the other 50,000 which continued to login at the same rate did not have rosters.

Users online | CPU usage

5.5k 10%
10k 16%
14.5k 21-30%
19k 26-34%
23.5k 33-38%
29k 41%

After the 30,000+ with rosters have all logged in we can see that the cpu usage drops a good bit

32k 25%
36.8k 28% cpu
42.2k 32%
33.2k = 27% cpu 76 per sec
37.8k = 28% cpu 75 per sec
42.3k = 34% cpu 64 per sec
46.1 = 38% cpu 67 per sec
50.2k = 39% cpu 67 per sec
54.5k = 52% cpu 72 per sec
59k = 51% cpu 75 per sec
63k = 54% cpu 68 per sec
67k = 63% cpu 64 per sec
71.3k = 64% cpu 71 per sec
75762 = 73% cpu 74 per sec
80k = 75% cpu 72 per sec

Even though 50,000 users did not have a roster and 30,000 did have VERY LARGE rosters, I think it is safe to say that these results are quite impressive.

It would be safe to say that users with 50-80 roster friends in the ā€œreal worldā€ would probably not be realistic. But I wanted to stress this server to see what it was capable of. Keep in mind, like I mentioned in the original post, if you can lessen the amount of users logging in per second, cpu usage comes down significantly.

In this test we were doing 70+ logins per second. If you had a load balancing server such as one with HA Proxy and 2 or 3 of these Intel i7 servers using the cluster plugin, you would be able to handle quite a lot of total connections as well as logins per second which also means lots of messages being sent per second. The better the hardware the better the results as you can see from all of my testing.

Also my roster cache which is 479mb was 87% used:

Roster 479.15 MB 417.31 MB 87.1%

Java Memory 1892.20 MB of 3572.00 MB (53.0%) used

Please redo with TLS and different key length (2048 bit, 4096 bit) in place.

Nobody uses unencrypted XMPP sessions nowadays.

Can you tell me how to configure tsung to do so?

Iā€™m no tsung expert, but have a look at [TSUN-305] Canā€™t connect with +TLS to ejabberd/XMPP - ProcessOne - Support

Of course you have to setup SSL/TLS in Openfire before.

This is with the Intel i7 Server again but this time with SSL enabled:

Openfire 4.0.0 Beta

Online | CPU

3064 15% cpu 74 per sec
7555 18% cpu 75 per sec
12053 28% cpu 75 per sec 1.6GB - 3GB Java out of 8GB RAM
16475 34% cpu 74 per sec
20749 43% cpu 71 per sec
25145 48-55% cpu 73 per sec
29460 48-58% cpu 72 per sec

Once again the users with rosters are all logged on now at 30,000+ users. 50-80 roster friends each just like previous test.
We continue logging on but these users have no rosters.

33767 38% cpu 72 per sec
37972 45% cpu 70 per sec
42245 50% cpu 71 per sec
46016 55% cpu 63 per sec
50044 58% cpu 67 per sec
54379 68% cpu 72 per sec
58512 74% cpu 68 per sec
62232 76% cpu 62 per sec java 3.5GB - 4.6GB
65789 83% cpu 59 per sec java 3.7GB - 5GB
69861 89% cpu 67 per sec java 3.7GB - 5GB
74169 95% cpu 72 per sec java 3.7GB - 5GB

Logged in with Spark, no problem and fast rosters/presence

78300 97% cpu
Around 81k Now hitting 97-100% cpu
82k users online. All have been logged in and no more are trying to log in.
Back down to idle cpu, all users still online.

java 4GB-5GB

CONCLUSION:

We use more CPU power when using SSL but results are still impressive considering I was logging in 70+ per second

As I mentioned above, I am using a custom jabber program for use with a remote control PC platform. I have the ability to send chat messages encrypted inside my program so I would be able to have the host and viewer PCs log in to OpenFire without SSL since auths are encrypted anyways and then when I send the chat commands back and forth, that string is encrypted so anyone sniffing the network would see the viewerPC user name and the hostPC username but would not be able to decipher the actual chat message (string).

This would enable me to use less CPU power if I needed to. I donā€™t think I will need to though so everything should be fine.

I will be getting a hardware Load Balancer (hopefully 2 instead of 1) that can handle SSL and then forward the rest of the connection to the OpenFire server. I will be testing this as soon as I get it and will report back here with my findings. From what I understand, at least with regular http traffic, a load balancer such as the one I am getting (should have ssl offloading) will make the connection from your PC to the website secure via https SSL by going through the load balancer. The load balancer has the SSL certificate installed. The load balancer then acts as a proxy to the actual web server and connects to it with non SSL such as http. This allows the webserver to handle many more requests with a lot less CPU power being needed. From the load balancer to the web server is unencrypted. It is not an issue since they are on the same network. No man in the middle attacks can happen.

I am going to test my OpenFire server the same way as soon as this Load Balancer gets delivered to me. As a side note, I tested the Hazelcast plugin with 4 nodes and HAProxy as a load balancer on another server and it worked quite well but I would prefer to have a 1U rack mountable Load Balancer (as a matter of fact 2 for redundancy). HAProxy canā€™t handle millions of concurrent connections. I was able to find 2 hardware load balancers that can handle 2 million concurrent connections each. HAProxy from what Iā€™ve read can handle around 300k concurrent connections but it requires a bit of tinkering and modifying in order to do that. I would rather get a couple of hardware load balancers that are made for this task right out of the box. Expensive brand new, but not so bad used.

Iā€™ll keep you all posted as soon as they come in and I get them setup.

I purchased a used server on Ebay for $300. It has Dual Xeon X5650 CPUs and 16GB RAM. Windows 2012 R2 just like previous tests. SSD hard drive. These CPUs have 6 cores each.

I will say right from the start, this serverā€™s results are extremely impressive. When looking at CPU benchmarks online:

Intel i7 4790k = 11218

Single Xeon X5650 = 7589

Dual Xeon X5650 = 11743

With SSL connections:

With the i7 CPU test above we had 29,460 48-58% cpu 72 logging in per sec.

With this Dual CPU server we had 30,379 32% cpu with 98 logging in per second.

It is the logins per second that really stress the CPU and when you have large rosters for each user that is even more stress on the CPU. So we have a lot less CPU being used and 26 more logins per second. QUITE IMPRESSIVE even though the benchmarks were almost identical. It is highly likely that the extra cache that come with all xeons as well as 6 cores per CPU played a role in these much better results.

Did I mention this was a $300 server from ebay!!?? WOW.

With results like these, I donā€™t care if my hardware load balancers canā€™t do the SSL initiation and then hand off non SSL as mentioned above in my previous post. Iā€™m still going to try but if it cannot do it, no big deal. I purchased 2 used load balancers for $110 each on Ebay by the way. They can run with redundancy where they share a floating IP so that if the master fails, the other one takes over without any problems.

Note: SSL is on and the first 30k users have 50-80 roster friends.

Tsung arrival rate:

Online | CPU
7,002 12% cpu 97 per sec
12,898 16% cpu 98 per sec
18,690 19% cpu 97 per sec
24,579 22% cpu 98 per sec
30,379 32% cpu
(12% cpu now that the users with 50-80 roster friends are not logging in.)
now just users without rosters.

36,139 14% cpu 96 per sec
41,898 15% cpu 96 per sec
46,744 15% cpu 80 per sec
51,743 16% cpu 83 per sec
56,542 16% cpu 80 per sec
60,900 17% cpu 73 per sec
64,814 17% cpu 65 per sec
68,909 20% cpu 68 per sec

Tsung arrival rate:

Online | CPU

17,555 33-40% 268 per sec
28,368 42% cpu 180 per sec

37,639 30% cpu 154 per sec
46,480 38% cpu 147 per sec
54,240 44% cpu 129 per sec

Tsung arrival rate:

Online | CPU

25,199 60-70% cpu 352 per sec
39,917 245 per sec. started touching 100% cpu
then droped to 81% cpu several seconds later
50,052 33% cpu
53,576 19% cpu

It appears that 2 of the 3 tsung servers quit logging in for some reason.
Perhaps the tsung 0.005 setting was just too fast.

1 Like

Hi I am trying 20K users on Amazon Ec2 Large server.

System Info:

Screenshot_2.png

Openfire Info:

Screenshot_1.png

10 user login per second.

5k users : 10% CPU

10k users : 30% CPU

20k users : 65% CPU

Even after users stops logging in the CPU remains on 60-70%.

I tried many command line settings on Java & still trying out but nothing seems to change this behavior.

I also tried another server with Windows 7, that takes 20% CPU for 20K users, but in this also the CPU doesnā€™t comes down after login is finished.

Looking for some expert advise.

Attached CPU graph after finishing 20K users log-in.

Screenshot_3.png

Hi Joe,

I read your post and I also tried to setup Tsung and conduct some performance tests. So far Iā€™m not able to authenticate Tsung with Openfire, got error message SASLAuthenticaiton - Client wants to do a MECH we donā€™t support: ā€˜ANONOYMOUSā€™. Could you please tell me what type of authentication do you use and how you generate users? Are they dynamic, or maybe do you use anonymous authentication? Could you please clarify? What is more will it be possible for you to share scenarios *.xml files with us?