Hazelcast cluster bottleneck on JMeter tests

Hi all,

We have been running some performance tests on Openfire using JMeter and the XMPP Protocol Support plugin (by BlazeMeter), with some interesting results, finding there is a considerable bottleneck using Hazelcast clustering plugin, let me share the info in case you have any ideas:

Infrastructure:
2 Openfire Machines
6 Cores processors
7 GB RAM

Test data:
4000 users
100 roster each
ramp-up: 30 seconds
connect timeout: 10 seconds

Test steps:

  1. Connect to server
  2. Login with user
  3. Set Presence
  4. Get roster

Meaning that in 30 seconds 4000 users must connect to the cluster, set their presence and retrieve their roster.

The tests run smoothly without clustering on each node separately, also runs OK in cluster but only with one node connected, the problem arises with both nodes clustered when failed transaction rate goes to the roof. These are the graphics of successful vs failed transactions:

One node with no Cluster: Failed rate 1.3%

One node alone with Cluster: Failed rate 3% (still acceptable)

Two nodes with Cluster: Failed rate 65%

The interesting part about the clustering test comes when we checked what is causing the failing: it is the “3. Set presence” step, for some reason that one is causing the huge failing.

Without it the fail rate drops to 0%

We have been trying to tweak hazelcast cache config and garbace collection settings here and there with no success, if you guys have any idea it will greatly help us to understand.

Best Regards!

What exactly does a “failed transaction” mean?

Greg

It means that one of the steps failed, more specific to this test it means that for one user “step 1. Connect to server” failed and hence the remaining steps also failed.

OK, but we really need to know a bit more than “A client attempted to login and failed”. Ideally client and server logs - they’re too big to post here, but somewhere like https://gist.github.com/ would be good.

Greg

You’re right, this is the all.log from the server side:

And this is the exception returned in JMeter:

Hope this helps, thanks in advance for your feedback

Looks like you’re tripping up over https://issues.igniterealtime.org/browse/OF-793

Not sure why it’s more likely to occur in a cluster, but probably a timing issue or similar.

Greg

I see… so it is an issue with Openfire and no Hazelcast right?

It seems a quick fix doesn’t exists for now, besides taking the cluster down, or lower the user load.

Anyhow thanks again for your help, and if you have any idea to improve or minimize the damage, it will be well received.

Best regards Greg!

Well, as you’re only testing, you could try repeating it with SSL disabled, see what sort of results that gives you. Clearly not a longer term solution, but would at least give you some results,

Greg

Sure, thank you!

I will do some tests and hope for the bug fix in the long term :stuck_out_tongue: