Hazelcast cluster bottleneck on JMeter tests

Rabbit · September 5, 2018, 3:00pm

Hi all,

We have been running some performance tests on Openfire using JMeter and the XMPP Protocol Support plugin (by BlazeMeter), with some interesting results, finding there is a considerable bottleneck using Hazelcast clustering plugin, let me share the info in case you have any ideas:

Infrastructure:
2 Openfire Machines
6 Cores processors
7 GB RAM

Test data:
4000 users
100 roster each
ramp-up: 30 seconds
connect timeout: 10 seconds

Test steps:

Connect to server
Login with user
Set Presence
Get roster

Meaning that in 30 seconds 4000 users must connect to the cluster, set their presence and retrieve their roster.

The tests run smoothly without clustering on each node separately, also runs OK in cluster but only with one node connected, the problem arises with both nodes clustered when failed transaction rate goes to the roof. These are the graphics of successful vs failed transactions:

One node with no Cluster: Failed rate 1.3%

One node alone with Cluster: Failed rate 3% (still acceptable)

Two nodes with Cluster: Failed rate 65%

The interesting part about the clustering test comes when we checked what is causing the failing: it is the “3. Set presence” step, for some reason that one is causing the huge failing.

Without it the fail rate drops to 0%

We have been trying to tweak hazelcast cache config and garbace collection settings here and there with no success, if you guys have any idea it will greatly help us to understand.

Best Regards!

gdt · September 5, 2018, 4:43pm

What exactly does a “failed transaction” mean?

Greg

Rabbit · September 5, 2018, 7:17pm

It means that one of the steps failed, more specific to this test it means that for one user “step 1. Connect to server” failed and hence the remaining steps also failed.

gdt · September 6, 2018, 9:16am

OK, but we really need to know a bit more than “A client attempted to login and failed”. Ideally client and server logs - they’re too big to post here, but somewhere like https://gist.github.com/ would be good.

Greg

Rabbit · September 7, 2018, 3:34pm

You’re right, this is the all.log from the server side:

gist.github.com

https://gist.github.com/Camilo-G/15263358914e3e225d6b3f4f1ee6bf9b.js

all.log.java

2018.09.07 11:25:07 WARN  [socket_c2s-thread-12]: org.jivesoftware.openfire.nio.ConnectionHandler - Closing connection due to exception in session: (0x0000EC0A: nio socket, server, null => 0.0.0.0/0.0.0.0:5222)
javax.net.ssl.SSLHandshakeException: SSL handshake failed.
        at org.apache.mina.filter.ssl.SslFilter.messageReceived(SslFilter.java:487)
        at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
        at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
        at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
        at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109)
        at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
        at org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410)
        at org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710)

This file has been truncated. show original

And this is the exception returned in JMeter:

gist.github.com

https://gist.github.com/Camilo-G/aa3b909f3b44384765c6c8d9850d7ef0.js

response.java

org.jivesoftware.smack.SmackException$NoResponseException
	at org.jivesoftware.smack.XMPPConnection.throwConnectionExceptionOrNoResponse(XMPPConnection.java:548)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.throwConnectionExceptionOrNoResponse(XMPPTCPConnection.java:867)
	at org.jivesoftware.smack.tcp.PacketReader.startup(PacketReader.java:113)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.initConnection(XMPPTCPConnection.java:482)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.connectUsingConfiguration(XMPPTCPConnection.java:440)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.connectInternal(XMPPTCPConnection.java:811)
	at org.jivesoftware.smack.XMPPConnection.connect(XMPPConnection.java:396)
	at com.blazemeter.jmeter.xmpp.actions.Connect.perform(Connect.java:18)
	at com.blazemeter.jmeter.xmpp.JMeterXMPPSampler.sample(JMeterXMPPSampler.java:57)

This file has been truncated. show original

Hope this helps, thanks in advance for your feedback

gdt · September 7, 2018, 3:59pm

Looks like you’re tripping up over https://issues.igniterealtime.org/browse/OF-793

Not sure why it’s more likely to occur in a cluster, but probably a timing issue or similar.

Greg

Rabbit · September 7, 2018, 4:06pm

I see… so it is an issue with Openfire and no Hazelcast right?

It seems a quick fix doesn’t exists for now, besides taking the cluster down, or lower the user load.

Anyhow thanks again for your help, and if you have any idea to improve or minimize the damage, it will be well received.

Best regards Greg!

gdt · September 7, 2018, 4:07pm

Well, as you’re only testing, you could try repeating it with SSL disabled, see what sort of results that gives you. Clearly not a longer term solution, but would at least give you some results,

Greg

Rabbit · September 7, 2018, 4:15pm

Sure, thank you!

I will do some tests and hope for the bug fix in the long term