Bitdefender Randomly Drops Clients

Michael42 · February 16, 2021, 6:29pm

At both sites I manage, all clients are on the same network at each particular site (same switch, router, etc.). One site uses a Cisco switch. The other site uses a Netgear switch. All computers are running the same OS (Windows 10 Pro). One site runs Spark version 2.9.4 and the other site runs a mixture of Spark 2.9.4 and 2.8.3.

I set up a test machine running Pidgin and will leave it powered on indefinitely to see whether it gets disconnected at any time.

Michael42 · February 16, 2021, 6:42pm

Hi Guus, both of my sites experience this disconnection problem. They both use different network equipment and everything else has been running fine for years. Even Openfire and Spark have been running fine up until several months ago when I first noticed this started happening at both sites.

Maybe I only noticed it once people started leaving their computers powered on all day/night so that they can connect to it remotely to work from home. So it’s possible that the issue has been around for awhile.

But even so, like you said, if there was some sort of network blip, Spark should be able to overcome that and re-establish the connection. Maybe something changed in the connection handler logic ?

All computers have “Sleep” and “Green Ethernet” disabled. So the NIC should continue to be active at all times.

I think extra debug logging in Spark would be very helpful and would tell us whether Spark is exiting for whatever reason on its own or if there’s something outside of Spark that’s causing it to abort.

Michael42 · February 16, 2021, 7:01pm

Thanks for your input. There are no issues with traffic routing or misconfigured network equipment. Switches are unmanaged. Cables are all tested & certified CAT6. Router configured by Cisco certified network engineer. Network connectivity has been tested and working perfectly. All applications work correctly - except Spark.

These drops only occur once in a couple of weeks and I only noticed it starting several months ago when people began leaving their computers powered on so that they can connect remotely. So even if there was some network blip, Spark should be able to recover and reconnect - not abruptly exit.

So if it was due to some network problem, then please explain why:

Is it happening on 2 different sites with different equipment?
Spark crashes or exits without even writing to its log files? Windows event logs also show no entries.

Just try this yourself:

Go to any computer running Spark and unplug it’s ethernet cable. Does Spark crash or exit? No, it does not. It will try to reconnect to its server but remain running on that workstation.

This is why I’m not going down the wireshark rabbit hole.

guus · February 17, 2021, 9:13am

Has anyone experimented with reproducing this problem, by causing network interruptions? I’m thinking about powering off a switch, or pulling out a network cable, things like that.

Insta-update: I should’ve read Michael’s last sentence more closely.

neo.rbk · February 17, 2021, 10:11am

As I assumed there could be two reasons -
Since OF log showing that message, so it could be something that closing Client Session from client end (as some of you noticed Spark crashed).
OR
something that sitting middle of Client and XMPP entity which is the reason behind this (this only happens if all those clients connected to same LAN/Router).

I think the issue is mostly related to the 1st reason “something closing client session from client end, as you noticed Crashing issue with Spark”.

I tried to reproduce the issue by using an old router which continuously drop packets and also tried to unplug the network cable. Spark did not crash and successfully reconnected itself once I plug the network cable back in.

You are right, Wireshark won’t help, already tested it as I used a faulty router.

By the way, did you noticed anything with Other XMPP client? (i.e. - Pidgin/Jitsi/Gajim).

If the user can connect OF from outside of the network than ask them to install same version of Spark on another system and see if the issue appears.

FYI: I tried Spark 2.9.2 & 2.9.4

guus · February 17, 2021, 3:18pm

There’s also a chance that it’s caused by XMPP traffic that’s broadcasted, causing all recipient Spark clients to crash. It’d be interesting to reason about what data could be broadcasted infrequently. Maybe something like a server broadcasted, which potentially is automated for admin users?

speedy · February 17, 2021, 3:24pm

What AV are you using? Is the AV doing any mitm network inspection? Please try disabling AV if you can.

Michael42 · February 17, 2021, 11:47pm

Thanks for your efforts. Okay, so we agree it’s not a network issue. I installed Pidgin on a test machine and have it running 24/7 and monitoring it for the next several weeks. I also have a Spark 2.9.4 client I’m running offsite, but connected remotely to that same OF server.

Will let you know what happens from here on in…

Michael42 · February 17, 2021, 11:50pm

Except that this also happens when no users are actively using Spark - like at 1:30 in the morning when they’re all at home sleeping (hopefully). Before they leave for the day, they just “lock” their Windows computers and leave Spark running with presence showing as “Away”.

Michael42 · February 17, 2021, 11:53pm

Hi Speedy, in an effort to troubleshoot, I also uninstalled the AV software from 2 of the computers at both sites and will monitor them as well in the next several weeks (it can take that long before a disconnect occurs). I’ll keep you apprised. Thanks for your input.

Michael42 · March 3, 2021, 3:08am

Hi Speedy, see my update in the main post above. Thanks again for your help.

wroot · March 3, 2021, 6:34am

Discourse makes it hard to go back to first messages (lots of scrolling), so i repost your comment

UPDATE 3/02/21:
I’ve been monitoring both sites and so far it seems that the random drops may be caused by the enterprise endpoint AV software. This is surprising since we’ve been using this software (Bitdefender GravityZone Business Security) for several years along with Spark/OF without issues.

I tracked it down to it’s “Advanced Anti-Exploit” module which seems to be enabled by default and kills running processes. So to test this theory, I disabled the module on one site and left it enabled on the other. Every 3 or 4 days, the drops are occurring on the site where it’s enabled. The other site has no drops at all.

I believe this module was updated automatically a year ago without my knowledge which is also around the time when the drops started to occur. Anyway, I will continue to monitor for another 2 weeks or so just to verify my suspicions and report back. If it turns out that this was the culprit, I will correct the title of this post accordingly and ask the mods to move it to the Spark forum.

wroot · March 3, 2021, 6:34am

Nice catch by Speedy and you.

Michael42 · March 3, 2021, 10:13pm

Thanks wroot! I’ll check back within 2 weeks to confirm whether this was the problem all along.

neo.rbk · March 4, 2021, 12:11pm

Will wait for your feedback. Meantime, if you care to share your feedback on using other XMPP clients (i.e. Pidgin), did you notice any issue or anything unexpected?

Michael42 · March 4, 2021, 5:44pm

Yes, it seems that only the Spark clients were dropped while the Pidgin clients were left alone.

Another thing I noticed while testing:

I re-enabled Stream Management on the Openfire server (v4.6.2)
Restarted the Openfire service
Restarted the Spark clients (v2.9.4)

When I go to “Client Sessions”, Stream Management shows as being “Disabled” on each Spark client. However, on the Pidgin clients, it shows correctly as being enabled. Has SM been permanently disabled on these versions?

wroot · March 4, 2021, 7:40pm

Yes, SM is disabled in Spark. See comments here [SPARK-2140] - Ignite Realtime Jira

Michael42 · March 5, 2021, 4:04am

Oh, I wasn’t aware of that. Hopefully, it will get fixed someday. Thanks for the update.

neo.rbk · March 5, 2021, 7:13am

It seems the problem actually related to Spark.

Just for curiosity, anyone facing the same issue with Spark but using different version of OF (not 4.6.1 or 4.6.2) ?
Coz, I want to make sure the issue can be fixed from client end (i.e. Spark) Or it still needs some adjustment from server (OF).

ilyaHlevnoy · March 5, 2021, 8:11am

Hey!
I’m using Openfire 4.5.4 and 200 Spark 2.9.4 users.
There are no problems, some users have uptime for a month or more.