Inconsistent Roster Cache with hazelcast Clustering enabled


Openfire uses cache to not query rosters from database on each usage. When hazelcast clustering plugin enabled this is a clustered cache that returns clones of the rosters.

Updating Rosters in PresenceSubscribeHandler does not lock on the cache and roster. This leads to various race conditions when hazelcast clustering is enabled. Example:

  • user1 subscribes to two other users (user2, user3).

  • user2 and user3 accept the subscription

  • the subscribed messages of these users arrive concurrently at openfire.

  • the roster of the senders (user2, user3) get updated successfully

  • now two different threads (thread1 for user2, thread2 for user3) (or cluster even different cluster nodes) invoke getRoster() in order to update roster of user1

  • because hazelcast clustering is enabled both threads get different copies of the roster

  • thread1 updates the roster item of user2 inside the roster of user1

  • thread2 updates the roster item of user3 inside the roster of user1

  • thread2 wins because it was last and overrides the changes of thread1

I think this is only one example. There are for sure other timing dependent sequences. We have also some times the state that some roster entries miss at all when clustering is enabled.

Anybody else with this issue, or has already someone fixed this?

Thanks in advance