OpenFire Cluster unable to recover from nodes crashing

Setup:

  • 2 nodes in cluster using Hazelcast+AWS connector (OpenJDK 8u101, OpenFire 4.0.2, HazelCast 2.2.0)

  • 2 desktop clients connected to different nodes

Steps to reproduce:

  • Send a message from client A (connected to node A) to client B (connected to node B)
  • Client B receives the message
  • Send a SIGTERM to the OpenFire process running on node A
  • Restart OpenFire on node A
  • Reconnect client A
  • Send a message from client A (connected to node A) to client B (connected to node B)
  • Client B receives the message
  • Send a SIGTERM to the OpenFire process running on node B
  • Restart OpenFire on node B
  • Reconnect client B
  • Send a message from client A (connected to node A) to client B (connected to node B)

Results:

Message never arrives to client B

Logs from node B: [Java] openfire bug - Pastebin.com

Notes:

This HazelCast issue (Null value on compute remove · Issue #7020 · hazelcast/hazelcast · GitHub ) seems similar.

I tried the same test using openJDK 8-b132, but the bug is still reproduced.

I tried the same test using openJDK 9-b132, but OpenFire doesn’t boot.

Thanks for the detailed report. I’ve registered this in our issue tracker as https://issues.igniterealtime.org/browse/OF-1178

1 Like

I’ve created a different issue for the Java 9 issue: https://issues.igniterealtime.org/browse/OF-1179

Being more precise: OpenFire starts, Hazelcast don’t, and my Desktop client is not able to connect to the server

Openfire logs: [Java] openfire + hazelcast + java9 - Pastebin.com

I’d be interested in seeing if your results mirror mine - try adding a 40-60 second delay between SIGTERM and restarting. I had a lot of issues with rolling restarts until I slowed down the process significantly and waited for hazelcast timeouts.

https://community.igniterealtime.org/thread/59205****