We believe that there is an issue in the hazelcast plugin initialization.
The normal flow:
- During plugin initialization, in the constructor of ClusterListener, the plugin creates various caches.
- It also adds EntryListeners to these caches.
- These EntryListeners upon certain events on the cache, populates ‘nodeSessions’ in the ClusterListener that maintains certain information of various caches in each node of the cluster.
- When a member leaves the cluster, some other member assumes the ‘senior’ position and then tries to clean up the node by removing the node’s footprint by calling ‘cleanupNode()’ for this node that is leaving.
- During initialization of the plugin, in the constructor of the Clusterlistener, the plugin creates various caches.
- Note at these point that the cache migration (to the ClusteredCache) has not been completed yet. The cache migration is completed within joinCluster() during the processing of event EventType.joined_cluster inside ClusterManager.
- Due to this, the entry listeners are not being added to the caches in the constructor of the ClusterListener.
- This leads to the fact that ‘nodeSessions’ data structure is not being populated during various events occurring on the caches.
- And then - when a member leaves, it’s sessions are not getting cleaned up correctly by the next ‘senior’ member.
- This creates zombie sessions (that has pending subscriptions) that never get cleaned up.
Whenever we migrate caches to ClusteredCache, we need to add entry listeners to them. Currently cache migration happens during handling of joined cluster event. I have attached a patch file that can be a possible solution to this problem. I did some testing after this cange and I can see that node data is geting cleaned up correctly during a rolling restart of the cluster.
Let me know if you need more information.