At both sites I manage, all clients are on the same network at each particular site (same switch, router, etc.). One site uses a Cisco switch. The other site uses a Netgear switch. All computers are running the same OS (Windows 10 Pro). One site runs Spark version 2.9.4 and the other site runs a mixture of Spark 2.9.4 and 2.8.3.
I set up a test machine running Pidgin and will leave it powered on indefinitely to see whether it gets disconnected at any time.
Hi Guus, both of my sites experience this disconnection problem. They both use different network equipment and everything else has been running fine for years. Even Openfire and Spark have been running fine up until several months ago when I first noticed this started happening at both sites.
Maybe I only noticed it once people started leaving their computers powered on all day/night so that they can connect to it remotely to work from home. So itās possible that the issue has been around for awhile.
But even so, like you said, if there was some sort of network blip, Spark should be able to overcome that and re-establish the connection. Maybe something changed in the connection handler logic ?
All computers have āSleepā and āGreen Ethernetā disabled. So the NIC should continue to be active at all times.
I think extra debug logging in Spark would be very helpful and would tell us whether Spark is exiting for whatever reason on its own or if thereās something outside of Spark thatās causing it to abort.
Thanks for your input. There are no issues with traffic routing or misconfigured network equipment. Switches are unmanaged. Cables are all tested & certified CAT6. Router configured by Cisco certified network engineer. Network connectivity has been tested and working perfectly. All applications work correctly - except Spark.
These drops only occur once in a couple of weeks and I only noticed it starting several months ago when people began leaving their computers powered on so that they can connect remotely. So even if there was some network blip, Spark should be able to recover and reconnect - not abruptly exit.
So if it was due to some network problem, then please explain why:
Is it happening on 2 different sites with different equipment?
Spark crashes or exits without even writing to its log files? Windows event logs also show no entries.
Just try this yourself:
Go to any computer running Spark and unplug itās ethernet cable. Does Spark crash or exit? No, it does not. It will try to reconnect to its server but remain running on that workstation.
This is why Iām not going down the wireshark rabbit hole.
Has anyone experimented with reproducing this problem, by causing network interruptions? Iām thinking about powering off a switch, or pulling out a network cable, things like that.
Insta-update: I shouldāve read Michaelās last sentence more closely.
As I assumed there could be two reasons - Since OF log showing that message, so it could be something that closing Client Session from client end (as some of you noticed Spark crashed). OR something that sitting middle of Client and XMPP entity which is the reason behind this (this only happens if all those clients connected to same LAN/Router).
I think the issue is mostly related to the 1st reason āsomething closing client session from client end, as you noticed Crashing issue with Sparkā.
I tried to reproduce the issue by using an old router which continuously drop packets and also tried to unplug the network cable. Spark did not crash and successfully reconnected itself once I plug the network cable back in.
You are right, Wireshark wonāt help, already tested it as I used a faulty router.
By the way, did you noticed anything with Other XMPP client? (i.e. - Pidgin/Jitsi/Gajim).
If the user can connect OF from outside of the network than ask them to install same version of Spark on another system and see if the issue appears.
Thereās also a chance that itās caused by XMPP traffic thatās broadcasted, causing all recipient Spark clients to crash. Itād be interesting to reason about what data could be broadcasted infrequently. Maybe something like a server broadcasted, which potentially is automated for admin users?
Thanks for your efforts. Okay, so we agree itās not a network issue. I installed Pidgin on a test machine and have it running 24/7 and monitoring it for the next several weeks. I also have a Spark 2.9.4 client Iām running offsite, but connected remotely to that same OF server.
Except that this also happens when no users are actively using Spark - like at 1:30 in the morning when theyāre all at home sleeping (hopefully). Before they leave for the day, they just ālockā their Windows computers and leave Spark running with presence showing as āAwayā.
Hi Speedy, in an effort to troubleshoot, I also uninstalled the AV software from 2 of the computers at both sites and will monitor them as well in the next several weeks (it can take that long before a disconnect occurs). Iāll keep you apprised. Thanks for your input.
Discourse makes it hard to go back to first messages (lots of scrolling), so i repost your comment
UPDATE 3/02/21:
Iāve been monitoring both sites and so far it seems that the random drops may be caused by the enterprise endpoint AV software. This is surprising since weāve been using this software (Bitdefender GravityZone Business Security) for several years along with Spark/OF without issues.
I tracked it down to itās āAdvanced Anti-Exploitā module which seems to be enabled by default and kills running processes. So to test this theory, I disabled the module on one site and left it enabled on the other. Every 3 or 4 days, the drops are occurring on the site where itās enabled. The other site has no drops at all.
I believe this module was updated automatically a year ago without my knowledge which is also around the time when the drops started to occur. Anyway, I will continue to monitor for another 2 weeks or so just to verify my suspicions and report back. If it turns out that this was the culprit, I will correct the title of this post accordingly and ask the mods to move it to the Spark forum.
Will wait for your feedback. Meantime, if you care to share your feedback on using other XMPP clients (i.e. Pidgin), did you notice any issue or anything unexpected?
Yes, it seems that only the Spark clients were dropped while the Pidgin clients were left alone.
Another thing I noticed while testing:
I re-enabled Stream Management on the Openfire server (v4.6.2)
Restarted the Openfire service
Restarted the Spark clients (v2.9.4)
When I go to āClient Sessionsā, Stream Management shows as being āDisabledā on each Spark client. However, on the Pidgin clients, it shows correctly as being enabled. Has SM been permanently disabled on these versions?
Just for curiosity, anyone facing the same issue with Spark but using different version of OF (not 4.6.1 or 4.6.2) ?
Coz, I want to make sure the issue can be fixed from client end (i.e. Spark) Or it still needs some adjustment from server (OF).