Issue in hazelcast plugin initialization

Harshad_D · August 30, 2013, 4:19pm

We believe that there is an issue in the hazelcast plugin initialization.

The normal flow:

During plugin initialization, in the constructor of ClusterListener, the plugin creates various caches.
It also adds EntryListeners to these caches.
These EntryListeners upon certain events on the cache, populates ‘nodeSessions’ in the ClusterListener that maintains certain information of various caches in each node of the cluster.
When a member leaves the cluster, some other member assumes the ‘senior’ position and then tries to clean up the node by removing the node’s footprint by calling ‘cleanupNode()’ for this node that is leaving.

The issue:

During initialization of the plugin, in the constructor of the Clusterlistener, the plugin creates various caches.
Note at these point that the cache migration (to the ClusteredCache) has not been completed yet. The cache migration is completed within joinCluster() during the processing of event EventType.joined_cluster inside ClusterManager.
Due to this, the entry listeners are not being added to the caches in the constructor of the ClusterListener.
This leads to the fact that ‘nodeSessions’ data structure is not being populated during various events occurring on the caches.
And then - when a member leaves, it’s sessions are not getting cleaned up correctly by the next ‘senior’ member.
This creates zombie sessions (that has pending subscriptions) that never get cleaned up.

The solution:

Whenever we migrate caches to ClusteredCache, we need to add entry listeners to them. Currently cache migration happens during handling of joined cluster event. I have attached a patch file that can be a possible solution to this problem. I did some testing after this cange and I can see that node data is geting cleaned up correctly during a rolling restart of the cluster.

Let me know if you need more information.

Thanks!

Tom_Evans1 · September 18, 2013, 12:34am

Hi Harshad -

Thanks for posting this patch and providing the detailed documentation. I will take a look at your patch and merge this into the plugin for the next update. Refer to OF-699 for status updates.

Tom

luke8 · February 28, 2014, 8:31pm

Thanks Harshad D! This patch fixed our issue and has saved us a lot of troubleshooting time. We were experiencing login issues when the Master server would switch to the Slave server and then back to Master.