Hazelcast plugin failing to wait for timeout on cluster tasks (possible cause)

Juan_Diego_Raimondi · September 16, 2015, 1:23pm

Hi! I’ve been dealing with the same problem as other, with Hazelcast reporting that it failed to execute cluster tasks in the set timeout period. As others, I have also seen that the expected timeout time had indeed not elapsed, and that sometimes the cluster nodes appear as not available from the clustering page in the admin console. Finally, I have also seen messages being lost and synchronization on clients connected to servers sometimes failing, and I am assuming that this last part is just a symptom of the same problem.

A colleague of mine spotted this in the hazelcast plugin source code.

Class: ClusteredCacheFactory

public Collection<Object> doSynchronousClusterTask(ClusterTask task, boolean includeLocalMember) {
    if (cluster == null) { return Collections.emptyList(); }
    Set<Member> members = new HashSet<Member>();
    Member current = cluster.getLocalMember();
    for(Member member : cluster.getMembers()) {
        if (includeLocalMember || (!member.getUuid().equals(current.getUuid()))) {
            members.add(member);
        }
    }
    Collection<Object> result = new ArrayList<Object>();
    if (members.size() > 0) {
        // Asynchronously execute the task on the other cluster members
        try {
            logger.debug("Executing MultiTask: " + task.getClass().getName());
            Map<Member, Future<Object>> futures = hazelcast.getExecutorService(HAZELCAST_EXECUTOR_SERVICE_NAME)
                .submitToMembers(new CallableTask<Object>(task), members);
            long nanosLeft = TimeUnit.SECONDS.toNanos(MAX_CLUSTER_EXECUTION_TIME*members.size());
            for (Future<Object> future : futures.values()) {
                long start = System.nanoTime();
                result.add(future.get(nanosLeft, TimeUnit.NANOSECONDS));
                nanosLeft = (System.nanoTime() - start);
            }
        } catch (TimeoutException te) {
            logger.error("Failed to execute cluster task within " + MAX_CLUSTER_EXECUTION_TIME + " seconds", te);
        } catch (Exception e) {
            logger.error("Failed to execute cluster task", e);
        }
    } else {
        logger.warn("No cluster members selected for cluster task " + task.getClass().getName());
    }
    return result;
}

This is the implementation that several cluster tasks will use to be executed across the cluster itself.

Pay attention to lines 17 and 21 of the snippet I pasted above. The nanosLeft variable is supposed to keep track of how many nanoseconds are missing until the timeout period expired, using that for the timeout value of the future result (line 20), but on line 21, the current value of nanosLeft is not considered at all. The new value for the time left is going to be System.nanoTime() - start, which basically is how much time the last future value took to get (lines 19, 21).

At this point, I understand that having multiple nodes may have some tasks failing, and with a higher number of nodes, the possibility of these tasks failing are higher.

My setup has two nodes, and the tasks sometimes fail and sometimes doesn’t. Considering what I just described, I think that tasks succeed when the second member can execute the task in a lower time than the first node, and the tasks fail otherwise.

My question for the Openfire community is: is the behavior I’m describing correct? Have we detected a bug in the Hazelcast clustering plugin?

Note that the hazelcast clustering plugin has been published with these changes about 7 months ago (according to GitHub), and I think that if this has been the case, the community would have been in an uproar for an implementation that fails 50% of the time. So, I believe that I may be missing part of the picture.

Juan2 · September 16, 2015, 2:32pm

As a proposed patch we changed the line 21 with the following:

nanosLeft = nanosLeft - (System.nanoTime() - start);

This will maintain the remaining time to spend in tasks, as per our assumptions.

Tom_Evans1 · September 16, 2015, 4:22pm

Thank you for submitting this analysis and a proposed fix. We will take a closer look and respond shortly.

Tom_Evans1 · September 16, 2015, 7:23pm

OK, I have confirmed this bug and opened [OF-943] Hazelcast cluster task(s) timeout prematurely to track.

By way of additional commentary, this bug would most severely affect clusters of three nodes or more, since most of the core synchronous cluster tasks currently exclude the local cluster node. For two-node clusters, the runtime stats collection task would be the most likely failure point. This is presumably why the Openfire admin console fails at times to report status correctly for non-local clustering nodes.

This was an excellent find. Thank you for submitting a PR for your proposed fix, we will incorporate this into the next update of the Hazelcast clustering plugin.

Juan_Diego_Raimondi · September 16, 2015, 7:26pm

Thanks a lot!

So, you’re confirming that while this is an actual bug, it should have no impact on message delivery across the cluster. Is that correct?

Tom_Evans1 · September 16, 2015, 7:47pm

Yes, that’s correct. For a two-node cluster, only the runtime cluster stats collection for the admin console would be affected by this bug. Below are the only two occurrences of a synchronous clustered task execution that includes the local/current node:

web/system-clustering.jsp (line 92)

org.jivesoftware.openfire.plugin.util.cluster.NodeRuntimeStats (line 57)

All other known occurrences of synchronous task execution within a cluster explicitly exclude the local node. While certainly incorrect (and highly annoying), this bug should not be the cause of any message loss or other functional failures in a two-node cluster. In a cluster having three or more nodes, the failure rate and impact would presumably be somewhat higher.

Srinivasan_B · September 22, 2015, 7:07am

Hi Tom,

Thanks for the fix.

As this also exists in Openfire 3.9.3 hazelcast plugin v1.2.2 which gives the below exception

[Thread- Jetty-QTP-AdminConsole-106] ERROR - [com.jivesoftware.util.cache.ClusteredCacheFactory:doSynchronousClusterTask] - Failed to execute cluster task within 30 seconds

java.util.concurrent.TimeoutException

at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.resolveResponse(Invocati onImpl.java:466)

at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.get(InvocationImpl.java: 314)

at com.hazelcast.util.executor.DelegatingFuture.get(DelegatingFuture.java:66)

at com.jivesoftware.util.cache.ClusteredCacheFactory.doSynchronousClusterTask(Clus teredCacheFactory.java:334)

at org.jivesoftware.util.cache.CacheFactory.doSynchronousClusterTask(CacheFactory. java:586)

at org.jivesoftware.openfire.admin.system_002dclustering_jsp._jspService(system_00 2dclustering_jsp.java:123)

Let us know whether the fix is available for the older version or advice us how this could be resolved