How many connections can Openfire handle?

Alex_Mateescu1 · June 25, 2012, 10:52am

I’m trying to test how many simultaneous connections can Openfire handle. I have found a limit at 4000 users. Anything beyond and the response times (for login) start taking several minutes.

I’m testing Openfire 3.6.4 (I’ll have to switch to 3.7.1 soon) w/ no connection manager (using tsung framework). No SSL either (yet). The server is an Amazon “machine” having two intel Xeon E5507@2.27GHz cores and 6GB RAM. It is a standard Openfire install, the only thing that I changed was to increase the maximum heap size to 4GB.

When testing with up to 4000 users, tsung reports a rate of about 50 requests per second being processed and a rather constant processing time of 2.5 seconds. If I change the users limit to 5000, there’s a linear increase in response time from 2 seconds, all the way up to 3 minutes 30 seconds. For the 5000 users test I cut the incoming rate down to 25 requests/second. Thing look a bit better if I further cut it down to 10 requests/second, but that already looks dangerously low (to me at least).

I would like to know if 4000 is a typical limitation with Openfire or (more likely) if there are settings that I should look into that can improve upon this number. Or maybe we just have to throw more hardware at it?

Alex_Mateescu1 · June 25, 2012, 12:32pm

Quck update. It seems I was able to connect 5,000 users now, so disregard that particular piece for now.

I’d still loke to know what is the typical limit and if there are specific setting that affect this. I have seen there’s a user cache, but its set at over 500,000 entries by default, so I guess at least that part is good enough. I have founf the page listing all the Openfire settings, but it’s huge, I could use a few pointers.

Edit: While I was able to connect 5,000 users, 6,000 still seems to be a bridge too far.

Robin_Collier · June 25, 2012, 1:39pm

This thread reports 250K concurrent users, and it is dated 2008.

http://community.igniterealtime.org/thread/31981

If you want to test for concurrency, you will get much more scalability by not using the embedded database. If you are, then you server is now doing double duty as an XMPP server and a database server, and using more memory to boot.

Alex_Mateescu1 · June 25, 2012, 2:46pm

Hehe, I know about reports of users having many more connections. The trouble is I started testing and got nowhere near the reported limits. I also came across connection managers, but found out they’re no longer required as Openfire is supposed to handle large number of connections ny itself now (though we’re aiming for TLS, so we might still need connection managers after all). Using Postgresql, installed on the same machine, btw.

In case you’re wondering why I need to find that limit, our company is using Openfire and we’re supposed to start contibuting back to the project eventually. But first, we need to determine suitability and that 4,000 is pretty far off the requirements.

Robin_Collier · June 25, 2012, 3:23pm

Fair enough, certainly noone can fault you for doing some due diligence

I have never needed to test its limits in those respects before, so I don’t think there is much I can say to help out.

You might get more feedback and suggestions if you provide more details of your tests. For example, are there rosters involved, which would have the side affect of more presence notifications.

What authentication provider are you using?

What is the rate of new connections?

What type of requests are you making to judge your response times?

Is the database running on the same machine?

You will probably need to do some profiling to figure out where your bottlenecks are and potentially change some properties to hopefully get some performance gains. For instance, it is quite possble that you need more db connections (pure speculation on my part).

Alex_Mateescu1 · June 25, 2012, 3:47pm

Ok, here goes.

My tests are using tsung (I’ll attach a sample if I find out how to do it). I do 4 transactions. First is connect. Then authenticate. Then comes presence:initial (with ack set to global, so that all users have to reach this point before disconnectiong). And last, close.

Rosters do not seem to matter much. I have separate tests with users having no rosters at all, test with users having 100 users in their rosters and then 250, 500 and 1000 users. Up to 4000 users, all of them respond just fine - users with larger rosters seem to be using a bit more memory, nothing unexpected there.

Authentication is set to plain.

For tests that ran ok, I saw that the server was processing about 50 authentications per second. So I set the incoming rate to 50/s. To go above the 4000 limit, I lowered that to 25 of even 10 incoming users per second, but that made little difference.

The transaction time I was watching was for authentication only (for more than 4000 users, the disconnect time also goes down the drain, but that’s not so crticial atm).

The DB is indeed running on the same machine.

I have already tried profiling, but if I set cpu=times the whole server slows to a crawl; cpu=samples on the other hand tells me which methods were called more frequently, not which used the most cpu cycles. i was hoping somebody more experienced might already know of possible hotspots to watch for.
connections-roster0000att.xml (2189 Bytes)

wroot · June 26, 2012, 2:11pm

Alex wrote:

I’ll attach a sample if I find out how to do it
You can attach files using advanced editor (top right corner of the editor)

Alex_Mateescu1 · July 3, 2012, 9:44am

To (partially) answer my own question, I seem to have reached 60,000 users, but the performance is horrible. If I include initial presence broadcast in the test, the test takes forever. Stay tuned.

Edit: after playing around with some cache sizes, the performance is back to normal. So 60,000 is doable, time to aim higher.

Dele_Olajide · July 3, 2012, 1:08pm

If you are running Openfire 3.6.4, I suggest you disable PEP. It has a memory leak and could affect your testing.

Alex_Mateescu1 · July 3, 2012, 1:30pm

Thanks, but I’ve upgraded to 3.7.1 in the meantime. I haven’t noticed leaks while testing with 3.6.4, memory always stayed within the same range. But maybe I wasn’t stressing it enough.

Robin_Collier · July 3, 2012, 4:55pm

It is unlikely that your tests are actually using PEP.

Alex_Mateescu1 · July 4, 2012, 9:16am

Do not assume I do not have an entire battery of tests lined up just for pubsub

It’s just with only 4,000 users, there may not have been much stress going on.

Robin_Collier · July 4, 2012, 1:59pm

Then PEP will almost certainly give you grief when you get there.

Pubsub, well, that depends on if you are using persistent nodes or not. Persistent nodes will leak memory, prior to the unreleased version 3.7.2.

I would be very curious as to other potential issues though, such as large number of subscribers. There are definitely some bottlenecks there as well, but I would be curious to see what the current performance is like.

Good luck!

Alex_Mateescu · August 3, 2012, 10:14am

Ok, the answer seems to be: at least 200,000 on an Amazon large server.

You have to tweak OF’s caches and you only get acceptable performance when all users are cached - meaning that if the server crashes and many users will try to reconnect, you’re basically screwed. DB must also be up to speed, in my case tweaking postgresql helped boost performance.

Fwiw, the default configuration was only good for 4,000 users, it may be a good idea to offer more sensible defaults out of the box.

Dele_Olajide · August 3, 2012, 10:28am

Alex, thanks for providing this information. I am sure many us will find this very useful.

I have maked your question as answered. Could we ask a big favour please?

Could you at any spare moment document the tweaks you carried out to get from 4,000 to 200,000. This would help to change the default values baked into the code.

Alex_Mateescu · August 3, 2012, 11:57am

My cache settings are:

cache.username2roster.size: 12000000
cache.group.size: 40000000
cache.userCache.size: 35000000
cache.lastActivity.size: 1500000
cache.offlinePresence.size: 1500000

The values are not fine tuned, but they’re good enough so that cache culling happens every few minutes instead of tens or hundreds of times a second - that was really killing performance. I have only used users with no rosters. Once rosters come into play, additional caches might need adjusting (also the number of connections might take a nose-dive).

I don’t know if you want the default caches to be good enough for 200,000 users, that may waste a lot of memory for smaller instalations. I think the nicest approach would be to document something like: “you need roughly X bytes for cache Y for each 10,000 users”. Then provide defaults good enough for 25,000, maybe 50,000 users and make that clear in the documentation. I won’t have time to make those measurements and you probably won’t either, but I think you can derive something from the values above. Should any other caches need adjusting, I’ll try to post updates here. Alternatively, you could just document that admins should watch out for cache pruning and ensure they don’t happen too often. That would be the easy way out.

I have increased the number of DB connections from 25 to 100, but I can’t say whether that helped any as I haven’t tested the change in isolation.

It may also be worth documenting that cache sizes are measured in bytes. I assumed they were measured in entries, that was quite a head-scratcher.

Dele_Olajide · August 3, 2012, 2:02pm

Top marks to you Alex . Thanks.

Alex_Mateescu · August 4, 2012, 10:35pm

I forgot to metion at some point I had to increase the heap size, but that’s a pretty common trick and quite obvious when you have to do it.

Jupp · August 8, 2012, 1:37am

Alex,

I have grouped everyone in organization to their respective department so that colleagues can just

find others by going to the displayed department. So there are about 80 groups in every user’s roster.

I have been adjusting the figures of the cache and maxLifetime setting but do not seems to get it right. When many users login during peak time, server seems to respond slow, although the login will be successful eventually.

Can you share the maxLifetime cache setting or any other that you think would have impact as well?

Thanks for the good postings.

Alex_Mateescu · August 8, 2012, 9:13am

I have yet to test users actually having contacts. I have used the default lifetime which is ‘never expire’ I believe.

But the catch is users have to be cached beforehand. When users are not cached and there are many incoming connections, the performance is poor. I had authentication requests that took over an hour to complete (this was a stress test, normal usage shouldn’t reach that high).