Poor performance after clustering is enabled

I’ve been working on an openfire clustering solution for my employer. Our plan was to have a server in each one of our datacenters (4). The problem arises when I enable clustering (Hazelcast). The admin interface becomes slow and unresponsive, presence updates take minutes, and chat sessions also take multiple minutes to kick in. Users that just log in have to wait 3-7 minutes before their messages go through.

I’ve added the following to the hazelcast configuration, but it doesn’t seem to change anything.

80

40

The only thing in the logs of note are errors like this.

org.jivesoftware.openfire.handler.IQHandler - Internal server error

java.lang.IllegalArgumentException: IQ must be of type ‘set’ or ‘get’. Original IQ: <iq type=“result”

Any suggestions?

The Hazelcast clustering components are quite sensitive to network latency and perform best between nodes that are on the same LAN (or high-speed backbone). If you want to distribute your servers among multiple sites, your best bet will be to set up separate XMPP domains (or subdomains) and federate them via the S2S protocol. You can then use clustering within each site/domain to provide scaling and redundancy. I have done this with good results in multiple deployments.

If you really want to tune Hazelcast to work over a WAN you might be able to find some additional info in their (HZ) documentation, but several folks have reported that a single cluster does not perform well across multiple sites.

Hope that helps.