OpenFire Cluster unable to recover from nodes crashing

Salvador1 · August 23, 2016, 10:00am

Setup:

2 nodes in cluster using Hazelcast+AWS connector (OpenJDK 8u101, OpenFire 4.0.2, HazelCast 2.2.0)
2 desktop clients connected to different nodes

Steps to reproduce:

Send a message from client A (connected to node A) to client B (connected to node B)
Client B receives the message
Send a SIGTERM to the OpenFire process running on node A
Restart OpenFire on node A
Reconnect client A
Send a message from client A (connected to node A) to client B (connected to node B)
Client B receives the message
Send a SIGTERM to the OpenFire process running on node B
Restart OpenFire on node B
Reconnect client B
Send a message from client A (connected to node A) to client B (connected to node B)

Results:

Message never arrives to client B

Logs from node B: [Java] openfire bug - Pastebin.com

Notes:

This HazelCast issue (Null value on compute remove · Issue #7020 · hazelcast/hazelcast · GitHub ) seems similar.

I tried the same test using openJDK 8-b132, but the bug is still reproduced.

I tried the same test using openJDK 9-b132, but OpenFire doesn’t boot.

guus · August 23, 2016, 10:09am

Thanks for the detailed report. I’ve registered this in our issue tracker as https://issues.igniterealtime.org/browse/OF-1178

guus · August 23, 2016, 10:10am

I’ve created a different issue for the Java 9 issue: https://issues.igniterealtime.org/browse/OF-1179

Salvador1 · August 23, 2016, 10:40am

Being more precise: OpenFire starts, Hazelcast don’t, and my Desktop client is not able to connect to the server

Openfire logs: [Java] openfire + hazelcast + java9 - Pastebin.com

Nathan_Neulinger · August 30, 2016, 8:58pm

I’d be interested in seeing if your results mirror mine - try adding a 40-60 second delay between SIGTERM and restarting. I had a lot of issues with rolling restarts until I slowed down the process significantly and waited for hazelcast timeouts.

https://community.igniterealtime.org/thread/59205****