Bitdefender Randomly Drops Clients

At both sites I manage, all clients are on the same network at each particular site (same switch, router, etc.). One site uses a Cisco switch. The other site uses a Netgear switch. All computers are running the same OS (Windows 10 Pro). One site runs Spark version 2.9.4 and the other site runs a mixture of Spark 2.9.4 and 2.8.3.

I set up a test machine running Pidgin and will leave it powered on indefinitely to see whether it gets disconnected at any time.

Hi Guus, both of my sites experience this disconnection problem. They both use different network equipment and everything else has been running fine for years. Even Openfire and Spark have been running fine up until several months ago when I first noticed this started happening at both sites.

Maybe I only noticed it once people started leaving their computers powered on all day/night so that they can connect to it remotely to work from home. So itā€™s possible that the issue has been around for awhile.

But even so, like you said, if there was some sort of network blip, Spark should be able to overcome that and re-establish the connection. Maybe something changed in the connection handler logic ?

All computers have ā€œSleepā€ and ā€œGreen Ethernetā€ disabled. So the NIC should continue to be active at all times.

I think extra debug logging in Spark would be very helpful and would tell us whether Spark is exiting for whatever reason on its own or if thereā€™s something outside of Spark thatā€™s causing it to abort.

Thanks for your input. There are no issues with traffic routing or misconfigured network equipment. Switches are unmanaged. Cables are all tested & certified CAT6. Router configured by Cisco certified network engineer. Network connectivity has been tested and working perfectly. All applications work correctly - except Spark.

These drops only occur once in a couple of weeks and I only noticed it starting several months ago when people began leaving their computers powered on so that they can connect remotely. So even if there was some network blip, Spark should be able to recover and reconnect - not abruptly exit.

So if it was due to some network problem, then please explain why:

  1. Is it happening on 2 different sites with different equipment?

  2. Spark crashes or exits without even writing to its log files? Windows event logs also show no entries.

Just try this yourself:

Go to any computer running Spark and unplug itā€™s ethernet cable. Does Spark crash or exit? No, it does not. It will try to reconnect to its server but remain running on that workstation.

This is why Iā€™m not going down the wireshark rabbit hole.

Has anyone experimented with reproducing this problem, by causing network interruptions? Iā€™m thinking about powering off a switch, or pulling out a network cable, things like that.

Insta-update: I shouldā€™ve read Michaelā€™s last sentence more closely. :roll_eyes:

As I assumed there could be two reasons -
Since OF log showing that message, so it could be something that closing Client Session from client end (as some of you noticed Spark crashed).
OR
something that sitting middle of Client and XMPP entity which is the reason behind this (this only happens if all those clients connected to same LAN/Router).

I think the issue is mostly related to the 1st reason ā€œsomething closing client session from client end, as you noticed Crashing issue with Sparkā€.

I tried to reproduce the issue by using an old router which continuously drop packets and also tried to unplug the network cable. Spark did not crash and successfully reconnected itself once I plug the network cable back in.

You are right, Wireshark wonā€™t help, already tested it as I used a faulty router.

By the way, did you noticed anything with Other XMPP client? (i.e. - Pidgin/Jitsi/Gajim).

If the user can connect OF from outside of the network than ask them to install same version of Spark on another system and see if the issue appears.

FYI: I tried Spark 2.9.2 & 2.9.4

Thereā€™s also a chance that itā€™s caused by XMPP traffic thatā€™s broadcasted, causing all recipient Spark clients to crash. Itā€™d be interesting to reason about what data could be broadcasted infrequently. Maybe something like a server broadcasted, which potentially is automated for admin users?

What AV are you using? Is the AV doing any mitm network inspection? Please try disabling AV if you can.

Thanks for your efforts. Okay, so we agree itā€™s not a network issue. I installed Pidgin on a test machine and have it running 24/7 and monitoring it for the next several weeks. I also have a Spark 2.9.4 client Iā€™m running offsite, but connected remotely to that same OF server.

Will let you know what happens from here on inā€¦

Except that this also happens when no users are actively using Spark - like at 1:30 in the morning when theyā€™re all at home sleeping (hopefully). Before they leave for the day, they just ā€œlockā€ their Windows computers and leave Spark running with presence showing as ā€œAwayā€.

Hi Speedy, in an effort to troubleshoot, I also uninstalled the AV software from 2 of the computers at both sites and will monitor them as well in the next several weeks (it can take that long before a disconnect occurs). Iā€™ll keep you apprised. Thanks for your input.

Hi Speedy, see my update in the main post above. Thanks again for your help.

Discourse makes it hard to go back to first messages (lots of scrolling), so i repost your comment :wink:

UPDATE 3/02/21:
Iā€™ve been monitoring both sites and so far it seems that the random drops may be caused by the enterprise endpoint AV software. This is surprising since weā€™ve been using this software (Bitdefender GravityZone Business Security) for several years along with Spark/OF without issues.

I tracked it down to itā€™s ā€œAdvanced Anti-Exploitā€ module which seems to be enabled by default and kills running processes. So to test this theory, I disabled the module on one site and left it enabled on the other. Every 3 or 4 days, the drops are occurring on the site where itā€™s enabled. The other site has no drops at all.

I believe this module was updated automatically a year ago without my knowledge which is also around the time when the drops started to occur. Anyway, I will continue to monitor for another 2 weeks or so just to verify my suspicions and report back. If it turns out that this was the culprit, I will correct the title of this post accordingly and ask the mods to move it to the Spark forum.

Nice catch by Speedy and you.

Thanks wroot! Iā€™ll check back within 2 weeks to confirm whether this was the problem all along.

1 Like

Will wait for your feedback. Meantime, if you care to share your feedback on using other XMPP clients (i.e. Pidgin), did you notice any issue or anything unexpected?

Yes, it seems that only the Spark clients were dropped while the Pidgin clients were left alone.

Another thing I noticed while testing:

  1. I re-enabled Stream Management on the Openfire server (v4.6.2)
  2. Restarted the Openfire service
  3. Restarted the Spark clients (v2.9.4)

When I go to ā€œClient Sessionsā€, Stream Management shows as being ā€œDisabledā€ on each Spark client. However, on the Pidgin clients, it shows correctly as being enabled. Has SM been permanently disabled on these versions?

Yes, SM is disabled in Spark. See comments here [SPARK-2140] - Ignite Realtime Jira

Oh, I wasnā€™t aware of that. Hopefully, it will get fixed someday. Thanks for the update.

It seems the problem actually related to Spark.

Just for curiosity, anyone facing the same issue with Spark but using different version of OF (not 4.6.1 or 4.6.2) ?
Coz, I want to make sure the issue can be fixed from client end (i.e. Spark) Or it still needs some adjustment from server (OF).

Hey!
Iā€™m using Openfire 4.5.4 and 200 Spark 2.9.4 users.
There are no problems, some users have uptime for a month or more.