Bitdefender Randomly Drops Clients

Hello,

I came across a strange issue involving random disconnections. We currently have Openfire 4.6.1 installed on a Windows 2019 Server with the clients running Spark 2.9.4 on Windows 10 Pro. There are only 6 users in a small office. It is an AD environment and Openfire is configured to use LDAPS and an embedded database.

The problem is that clients seem to get disconnected at random times. Sometimes 3 clients get disconnected one after the other while the others remain connected. At other times, all 6 can get disconnected within a few seconds of the other. This occurs once in a couple of weeks or days.

It actually started in previous versions of the server and client. But upgrading to the latest versions did not seem to help and the issue still exists. After a client is forcibly disconnected from the server, the Spark client is no longer running on those computers and Spark has to be re-started to re-connect. I tried setting the client idle timeout (xmpp.client.idle) as well as disabling stream management (stream.management.active) but those settings did not make any difference.

The client log files show no entries for these disconnects. The server log entries are as follows for each client that is dropped:

2021.01.11 11:06:05 org.jivesoftware.openfire.archive.ArchiveIndexer[MUCSEARCH] - Updating (not creating) an index since 'EPOCH'. This is suspicious, as it suggests that an existing, but empty index is being operated on. If the index is non-empty, index duplication might occur.
2021.01.11 11:11:05 org.jivesoftware.openfire.archive.ArchiveIndexer[MUCSEARCH] - Updating (not creating) an index since 'EPOCH'. This is suspicious, as it suggests that an existing, but empty index is being operated on. If the index is non-empty, index duplication might occur.
2021.01.11 11:16:05 org.jivesoftware.openfire.archive.ArchiveIndexer[MUCSEARCH] - Updating (not creating) an index since 'EPOCH'. This is suspicious, as it suggests that an existing, but empty index is being operated on. If the index is non-empty, index duplication might occur.
2021.01.11 11:17:21 org.jivesoftware.openfire.nio.ConnectionHandler - Closing connection due to exception in session: (0x00000004: nio socket, server, /192.168.0.101:51469 => /192.168.0.10:5222)
java.io.IOException: An existing connection was forcibly closed by the remote host
	at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[?:1.8.0_271]
	at sun.nio.ch.SocketDispatcher.read(Unknown Source) ~[?:1.8.0_271]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) ~[?:1.8.0_271]
	at sun.nio.ch.IOUtil.read(Unknown Source) ~[?:1.8.0_271]
	at sun.nio.ch.SocketChannelImpl.read(Unknown Source) ~[?:1.8.0_271]
	at org.apache.mina.transport.socket.nio.NioProcessor.read(NioProcessor.java:378) ~[mina-core-2.1.3.jar:?]
	at org.apache.mina.transport.socket.nio.NioProcessor.read(NioProcessor.java:47) ~[mina-core-2.1.3.jar:?]
	at org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:519) ~[mina-core-2.1.3.jar:?]
	at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$1200(AbstractPollingIoProcessor.java:68) ~[mina-core-2.1.3.jar:?]
	at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.process(AbstractPollingIoProcessor.java:1222) ~[mina-core-2.1.3.jar:?]
	at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.process(AbstractPollingIoProcessor.java:1211) ~[mina-core-2.1.3.jar:?]
	at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:683) ~[mina-core-2.1.3.jar:?]
	at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) ~[mina-core-2.1.3.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:1.8.0_271]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:1.8.0_271]
	at java.lang.Thread.run(Unknown Source) [?:1.8.0_271]

See the attached warn.log file where these strange errors are recorded. I’m not certain what those MUCSEARCH errors mean or whether they even relate to this issue, but we don’t use conference rooms or Multi-User Chat rooms at all.

warn.log (625.1 KB)

I would appreciate any help that anyone can offer. Thank you so much!

UPDATE 2/02/21:
I uninstalled the monitoring plugin and the messages about MUCSEARCH have vanished. But the random client drops still occur. So I guess one had nothing to do with the other.

I enabled the debug log in hopes of getting more details about those disconnects. Here is a sample block that I extracted from the debug.log showing the messages as 3 client connections are dropped one after the other just seconds apart:

2021.02.02 18:14:48 org.apache.mina.filter.executor.OrderedThreadPoolExecutor - Adding event SESSION_IDLE to session 17
Queue : [SESSION_IDLE, ]

2021.02.02 18:14:48 org.jivesoftware.openfire.nio.ClientConnectionHandler - ConnectionHandler: Pinging connection that has been idle: org.jivesoftware.openfire.nio.NIOConnection@78d2c2e3 MINA Session: (0x00000011: nio socket, server, /192.168.0.102:62866 => /192.168.0.10:5222)
2021.02.02 18:14:48 org.apache.mina.filter.ssl.SslFilter - Session Server[17](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=137 cap=4096: 3C 69 71 20 74 79 70 65 3D 22 67 65 74 22 20 69...]
2021.02.02 18:14:48 org.apache.mina.filter.executor.OrderedThreadPoolExecutor - Adding event MESSAGE_SENT to session 17
Queue : [MESSAGE_SENT, ]

2021.02.02 18:14:48 org.apache.mina.filter.ssl.SslFilter - Session Server[17](SSL): Message received : HeapBuffer[pos=0 lim=133 cap=256: 17 03 03 00 80 55 7A FF 92 88 93 0E 7D 8D AF 95...]
2021.02.02 18:14:48 org.apache.mina.filter.ssl.SslHandler - Session Server[17](SSL) Processing the received message
2021.02.02 18:14:48 org.apache.mina.filter.ssl.SslFilter - Session Server[17](SSL): Processing the SSL Data 
2021.02.02 18:14:48 org.apache.mina.filter.executor.OrderedThreadPoolExecutor - Adding event MESSAGE_RECEIVED to session 17
Queue : [MESSAGE_RECEIVED, ]

2021.02.02 18:14:48 org.apache.mina.filter.codec.ProtocolCodecFilter - Processing a MESSAGE_RECEIVED for session 17
2021.02.02 18:14:50 org.apache.mina.filter.executor.OrderedThreadPoolExecutor - Adding event EXCEPTION_CAUGHT to session 17
Queue : [EXCEPTION_CAUGHT, ]

2021.02.02 18:14:50 org.apache.mina.filter.ssl.SslHandler - Unexpected exception from SSLEngine.closeInbound().
javax.net.ssl.SSLException: closing inbound before receiving peer's close_notify
	at sun.security.ssl.Alert.createSSLException(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.Alert.createSSLException(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.TransportContext.fatal(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.TransportContext.fatal(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.TransportContext.fatal(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.SSLEngineImpl.closeInbound(Unknown Source) ~[?:1.8.0_271]
	at org.apache.mina.filter.ssl.SslHandler.destroy(SslHandler.java:209) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.ssl.SslFilter.sessionClosed(SslFilter.java:485) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextSessionClosed(DefaultIoFilterChain.java:606) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$900(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.sessionClosed(DefaultIoFilterChain.java:1092) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.IoFilterAdapter.sessionClosed(IoFilterAdapter.java:98) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextSessionClosed(DefaultIoFilterChain.java:606) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.fireSessionClosed(DefaultIoFilterChain.java:599) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.service.IoServiceListenerSupport.fireSessionDestroyed(IoServiceListenerSupport.java:251) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.removeNow(AbstractPollingIoProcessor.java:1142) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.removeSessions(AbstractPollingIoProcessor.java:864) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:694) [mina-core-2.1.3.jar:?]
	at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) [mina-core-2.1.3.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_271]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_271]
	at java.lang.Thread.run(Unknown Source) [?:1.8.0_271]
2021.02.02 18:14:50 org.apache.mina.filter.executor.OrderedThreadPoolExecutor - Adding event SESSION_CLOSED to session 17
Queue : [SESSION_CLOSED, ]

2021.02.02 18:14:50 org.jivesoftware.openfire.SessionManager - Closing session with address user1@ourdomain.com/Spark 2.9.4 and streamID 9nwwokfj9q does not have SM enabled.
2021.02.02 18:14:50 org.jivesoftware.openfire.spi.RoutingTableImpl - Removing client route user1@ourdomain.com/Spark 2.9.4


	::
	::
	::
	::


2021.02.02 18:15:06 org.apache.mina.filter.executor.OrderedThreadPoolExecutor - Adding event EXCEPTION_CAUGHT to session 18
Queue : [EXCEPTION_CAUGHT, ]

2021.02.02 18:15:06 org.apache.mina.filter.ssl.SslHandler - Unexpected exception from SSLEngine.closeInbound().
javax.net.ssl.SSLException: closing inbound before receiving peer's close_notify
	at sun.security.ssl.Alert.createSSLException(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.Alert.createSSLException(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.TransportContext.fatal(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.TransportContext.fatal(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.TransportContext.fatal(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.SSLEngineImpl.closeInbound(Unknown Source) ~[?:1.8.0_271]
	at org.apache.mina.filter.ssl.SslHandler.destroy(SslHandler.java:209) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.ssl.SslFilter.sessionClosed(SslFilter.java:485) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextSessionClosed(DefaultIoFilterChain.java:606) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$900(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.sessionClosed(DefaultIoFilterChain.java:1092) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.IoFilterAdapter.sessionClosed(IoFilterAdapter.java:98) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextSessionClosed(DefaultIoFilterChain.java:606) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.fireSessionClosed(DefaultIoFilterChain.java:599) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.service.IoServiceListenerSupport.fireSessionDestroyed(IoServiceListenerSupport.java:251) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.removeNow(AbstractPollingIoProcessor.java:1142) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.removeSessions(AbstractPollingIoProcessor.java:864) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:694) [mina-core-2.1.3.jar:?]
	at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) [mina-core-2.1.3.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_271]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_271]
	at java.lang.Thread.run(Unknown Source) [?:1.8.0_271]
2021.02.02 18:15:06 org.apache.mina.filter.executor.OrderedThreadPoolExecutor - Adding event SESSION_CLOSED to session 18
Queue : [SESSION_CLOSED, ]

2021.02.02 18:15:06 org.jivesoftware.openfire.SessionManager - Closing session with address user2@ourdomain.com/Spark 2.9.4 and streamID 4gbu9b43tf does not have SM enabled.
2021.02.02 18:15:06 org.jivesoftware.openfire.spi.RoutingTableImpl - Removing client route user2@ourdomain.com/Spark 2.9.4


	::
	::
	::
	::


2021.02.02 18:15:55 org.apache.mina.filter.executor.OrderedThreadPoolExecutor - Adding event EXCEPTION_CAUGHT to session 19
Queue : [EXCEPTION_CAUGHT, ]

2021.02.02 18:15:55 org.apache.mina.filter.ssl.SslHandler - Unexpected exception from SSLEngine.closeInbound().
javax.net.ssl.SSLException: closing inbound before receiving peer's close_notify
	at sun.security.ssl.Alert.createSSLException(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.Alert.createSSLException(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.TransportContext.fatal(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.TransportContext.fatal(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.TransportContext.fatal(Unknown Source) ~[?:1.8.0_271]
	at sun.security.ssl.SSLEngineImpl.closeInbound(Unknown Source) ~[?:1.8.0_271]
	at org.apache.mina.filter.ssl.SslHandler.destroy(SslHandler.java:209) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.ssl.SslFilter.sessionClosed(SslFilter.java:485) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextSessionClosed(DefaultIoFilterChain.java:606) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$900(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.sessionClosed(DefaultIoFilterChain.java:1092) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.IoFilterAdapter.sessionClosed(IoFilterAdapter.java:98) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextSessionClosed(DefaultIoFilterChain.java:606) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.fireSessionClosed(DefaultIoFilterChain.java:599) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.service.IoServiceListenerSupport.fireSessionDestroyed(IoServiceListenerSupport.java:251) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.removeNow(AbstractPollingIoProcessor.java:1142) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.removeSessions(AbstractPollingIoProcessor.java:864) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:694) [mina-core-2.1.3.jar:?]
	at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) [mina-core-2.1.3.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_271]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_271]
	at java.lang.Thread.run(Unknown Source) [?:1.8.0_271]
2021.02.02 18:15:55 org.apache.mina.filter.executor.OrderedThreadPoolExecutor - Adding event SESSION_CLOSED to session 19
Queue : [SESSION_CLOSED, ]

2021.02.02 18:15:55 org.jivesoftware.openfire.SessionManager - Closing session with address user3@ourdomain.com/Spark 2.9.4 and streamID 1uxxh5fk8b does not have SM enabled.
2021.02.02 18:15:55 org.jivesoftware.openfire.spi.RoutingTableImpl - Removing client route user3@ourdomain.com/Spark 2.9.4

debug.log (10.7 KB)

Anyone have any ideas how to proceed from here? Thank you!

UPDATE 3/02/21:
I’ve been monitoring both sites and so far it seems that the random drops may be caused by the enterprise endpoint AV software. This is surprising since we’ve been using this software (Bitdefender GravityZone Business Security) for several years along with Spark/OF without issues.

I tracked it down to it’s “Advanced Anti-Exploit” module which seems to be enabled by default and kills running processes. So to test this theory, I disabled the module on one site and left it enabled on the other. Every 3 or 4 days, the drops are occurring on the site where it’s enabled. The other site has no drops at all.

I believe this module was updated automatically a year ago without my knowledge which is also around the time when the drops started to occur. Anyway, I will continue to monitor for another 2 weeks or so just to verify my suspicions and report back. If it turns out that this was the culprit, I will correct the title of this post accordingly and ask the mods to move it to the Spark forum.

Thanks to Speedy for getting me to take another closer look at the AV software!

UPDATE 3/15/21:
So after much testing, I concluded that it was indeed Bitdefender’s newly introduced “Advanced Anti-Exploit” module that was causing all the client disconnections. Merely disabling the module had no effect - I had to actually remove the module from every workstation. Just the very presence of that module on a workstation, even when disabled, would still cause the drops.

Thank you everyone for your input and efforts especially, Guus, Wroot and Speedy!

Mods: Please move this post to the Spark Support forum. It turns out that Openfire had nothing to do with this issue after all. I renamed the title accordingly.

Hi! I also encountered such a problem, but after upgrading to openfire 4.5.4, the problem disappeared.

Thanks for your reply. But we’re already running the latest Openfire 4.6.1 with stream management disabled and continue to have the same problems. I don’t think we had the problem with version 4.5.0. It only started after upgrading to 4.5.1 and continued with 4.5.4 and 4.6.1.

What’s really strange is that it doesn’t simply disconnect the clients - each client’s spark process in memory is terminated and has to be restarted so it can reconnect to the Openfire server.

If it continues or gets worse, I may end up doing a fresh installation of 4.6.1 and manually entering all my settings. Of course, I was trying to put this off as long as possible.

Guus, if you’re also reading this… any ideas what may be going on???

Hey Michael. I wonder if the problem is in Spark, and not so much Openfire. This appears to be Openfire reacting to a ‘close’ that is initiated by the client. Do you have the same problem with other clients than Spark?

Spark also keeps some logs (in the user directory). Do those contain any hints.

The MUCSEARCH related warnings are very unlikely to be related. You can ignore those for now.

We are also having the issue with Openfire 4.6.1 and Spark 2.9.4.
Client Issues:

  1. Randomly get disconnected and their client force closes
  2. Their client is no longer connected and they just do not receive messages from other users until the user on the receiving end closes and reopens Spark
  3. Some of the clients which lost connection can no longer click on File >> Preferences. The client just hangs

Errors in logs:
(I removed XMPP Domain Name, Room IDs and Users from the log entries)

ConnectionHandler

2021.02.03 09:35:18 WARN [socket_c2s-thread-16]: org.jivesoftware.openfire.nio.ConnectionHandler - Closing connection due to exception in session: (0x000001FF: nio socket, server, null => 0.0.0.0/0.0.0.0:5222)

LdapVCardProvider
Since the upgrade to 4.6.1 we are receiving LdapVCardProvider errors as the system tries to look up Conference Rooms (MUC):

2021.02.03 10:18:51 org.jivesoftware.openfire.ldap.LdapVCardProvider - Username <Room ID>@conference.<XMPP Domain Name> not found
org.jivesoftware.openfire.user.UserNotFoundException: Username <Room ID>@conference.<XMPP Domain Name> not found
at org.jivesoftware.openfire.ldap.LdapManager.findUserRDN(LdapManager.java:1132) ~[xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.ldap.LdapManager.findUserRDN(LdapManager.java:1049) ~[xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.ldap.LdapVCardProvider.getLdapAttributes(LdapVCardProvider.java:184) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.ldap.LdapVCardProvider.loadVCard(LdapVCardProvider.java:238) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.vcard.VCardManager.getOrLoadVCard(VCardManager.java:245) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.vcard.VCardManager.getVCard(VCardManager.java:238) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.muc.spi.IQMUCvCardHandler.handleIQ(IQMUCvCardHandler.java:179) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl.process(MultiUserChatServiceImpl.java:467) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl.processPacket(MultiUserChatServiceImpl.java:370) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.component.InternalComponentManager$RoutableComponents.process(InternalComponentManager.java:859) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToComponent(RoutingTableImpl.java:451) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:266) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:350) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:118) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:74) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.net.StanzaHandler.processIQ(StanzaHandler.java:380) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.net.ClientStanzaHandler.processIQ(ClientStanzaHandler.java:95) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:322) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:209) [xmppserver-4.6.1.jar:4.6.1]
at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:183) [xmppserver-4.6.1.jar:4.6.1]

DefaultPubSubPersistenceProvider (PubSub Service Enabled is deselected)

2021.02.03 09:58:41 org.jivesoftware.openfire.pubsub.DefaultPubSubPersistenceProvider - Subscription found for a non-existent affiliate: <user1>@<XMPP Domain Name> in node: UniqueIdentifier{serviceId='<user2>@<XMPP Domain Name>', nodeId='<user2>@<XMPP Domain Name>'}
2021.02.03 09:58:41 org.jivesoftware.openfire.pubsub.DefaultPubSubPersistenceProvider - Subscription found for a non-existent affiliate: <user3>@<XMPP Domain Name> in node: UniqueIdentifier{serviceId='<user4>@<XMPP Domain Name>', nodeId='<user4>@<XMPP Domain Name>'}

We have checked client logs from the random users having the issue and found nothing from the time of the crash/issue.
We were hoping the upgrade to 2.9.5 or 3.0.0 would fix the issue with a newer version of Smack possibly. Any idea how those updates are going? We are anxiously watching https://igniterealtime.atlassian.net.

Hi Guus, thanks for your response. I will check the client’s Spark logs and get back to you. The reason I assumed it was Openfire causing the client drops was because multiple clients get dropped at almost the same time.

There are only 6 users at this site. The number of clients getting disconnected varies. Sometimes all clients get dropped while at other times maybe only 2 or 3 get dropped. In all cases, the disconnections occur within seconds of each other.

Is there a way to enable debug logging on the Spark client? If so, I’ll enable it on all the clients.

That is curious. Maybe I was wrong, and it is Openfire? Maybe some kind of network hiccup that a newer version of Spark is more susceptible to?

I don’t think there is.

I also wanted to emphasize that when I notice a client gets disconnected and I check the client computer - it is no longer running Spark, as if Spark has either crashed or exited. I then have to manually start Spark on that computer so it can reconnect to the Openfire server.

I’ll check the client’s Windows event logs to see if it shows anything interesting.
Will also check the antivirus package in case it’s involved in some way.

In the meantime, I selected: “Do not disconnect clients that are idle.” under Idle Connections Policy to see if that makes any difference.

Our environment:

Openfire 4.6.1 runs on Windows Server 2019 (Dell PowerEdge)
Spark 2.9.4 runs on Windows 10 Professional (Dell Optiplex)

Thanks again for your help!

–Michael

Just as an FYI, the setting “Do not disconnect clients that are idle.” is the first thing we tried and it has no affect on this issue.

Thanks for the heads up. What’s really strange is that Spark is no longer running on those machines that lose their connection to the server. It’s one thing to simply lose the connection, but quite another to crash or exit Spark.

This also indicates, like Guus said, that the problem resides on the clients. But then how is it that multiple clients exit (or crash?) Spark at once or within seconds of each other?

Have you tried using other versions of Spark or different clients altogether?

We did try downloading the nightly build of 2.9.5, but that was not stable enough. No, we have not downgraded the clients as it seems others have similar issues with older versions. Also, we use Client Control which requires us to use 2.9.4 in order to control the user’s plugins correctly.
The errors do seem to be stream management related, which is the piece that is disabled in Spark 2.9.0+, because it caused “issues”. What I found tracking down the issues is an issue with the Spark client not the server.
We found older forum posts that many users in the forums complained that they have intermittent connection issues with Openfire and Spark dating back to 2.8.3. In those posts it seemed Wroot and some other developers tracked down an issue originally with Spark 2.8.3 and Stream Management conflicts. The developers in 2.9.0+ disabled Stream Management (Aug 5, 2020) in Spark hoping that would temporarily correct the issue. I found all of the connections in Openfire have the Stream Management Disabled (Found in Openfire Management console under Sessions >> Client Sessions).
Issue SPARK-2140 states

Add support for XEP-0198: Stream Management was created and explained as “XEP-0198 “Stream Management” is a feature that allows a client to ‘survive’ a network hiccup, without having to fully redo the authentication cycle.”

As we have seen, closing and opening Spark was the only fix to “redo the authentication cycle”.
The release of SPARK-2140 is planned in Spark 3.0.0, but that has been delayed.

Yes, I believe we had this issue with some previous versions of the Spark client as well. This was also one of the reasons we upgraded to the latest version 2.9.4 - but the problem still exists. We also use the Client Control plugin (I contributed to some of its development, but cannot corelate anything in that code with the disconnections).

I disabled Stream Management and that doesn’t seem to make any difference. Clients get dropped whether it’s on or off. Even if it’s a stream management issue, would that make the client exit or crash so that it’s no longer running in memory? I’m not sure why that happens.

Sometimes the problem happens once in several weeks. While other times it can happen one in a couple of days.

We’ve been running openfire/spark for almost a decade and only recently are dealing with this nasty issue. I’m not sure exactly at what point it all began. I hope they’re able to figure it out. In the meantime, I’ll check the event logs on each disconnected client.

By the way, when you installed Spark 2.9.4, did you install it as an upgrade to a previous version or did you try doing a fresh install? I wonder if a fresh install would make any difference.

We have tried uninstalling, cleaning up directories and reinstalling and it “worked” for a while, but we have had users desktops which have never had the client, with a fresh Spark install and the disconnect happens all the time for them.
Stream Management is actually a “Keep Alive” functionality meant to have connections survive blips in the network. Stream Management was disabled programmatically in Spark 2.9.0, so there is no way to change it in the client. Disabling it in Openfire will not really do anything unless you use another client which still has Stream Management enabled, in which case I would not disable it in Openfire.

Do you believe that the problem can be resolved if they disable stream management on the Spark client? Also, have you tried using an entirely different xmpp client instead of Spark?

I checked all the client’s Windows event logs and found absolutely nothing out of the ordinary on those clients at the time they got dropped. I also checked their Spark logs and nothing was recorded at that time.

To be clear, the version tag doesn’t mean it is actually planned. Nothing is really planned with Spark as we usually have nobody working on it somewhat consistently. Contributors come and go. I mark some tickets with future version tags to not lose sight of them, so they are grouped and i can get the list by just opening a tag.

Also, i don’t remember there being disconnection problems because of SM in Spark 2.9.x. There was a problem that once Spark loses network connection, there will be a limbo session left on a server. That’s why SM was disabled in Spark. Because it doesn’t know how to properly reconnect to SM session, so its reconnect logic must be tuned for that.

Stream Management is disabled and cannot be enabled in Spark 2.9.0+.

This forum post hints at Stream Management being an issue with Spark.
In Spark-2140 you and Guus mention disabling Stream Management for now on 05-AUG-2020.

So Disable Smacks support for Stream Management (for now) by guusdk ¡ Pull Request #502 ¡ igniterealtime/Spark ¡ GitHub disables SM for now? Until reconnection logic is adjusted?

As explained in XEP-0198: Stream Management

“This specification defines an XMPP protocol extension for active management of an XML stream between two XMPP entities, including features for stanza acknowledgements and stream resumption…is a feature that allows a client to ‘survive’ a network hiccup, without having to fully redo the authentication cycle.”

It is the stream resumption that seems to be at fault here. We have clients that suddenly cannot receive messages until they exit and reopen their Spark clients, which is a “redo of the authentication cycle.”
As for version tags it was giving me hope. Someone was actually making progress on the client. Dates were being set and met up to version 2.9.4 and now we’re just anxiously waiting.

Just curious, what is your environment and how often do the disconnections occur? Are your users actively using their computers at the time of disconnection? Or do you notice the disconnections while a workstation is in a “locked” state for several hours or days?

We have Openfire 4.6.1 (Linux Redhat) and Spark 2.9.4 (Windows 10). No, the times are random. The users have reported that they were actively using the machines at the time.

Thanks for your reply. I was trying to compare notes to see if I can find anything common between our environments. I recently updated to Openfire 4.6.2 at both sites which run on Windows 2019 Server. Our clients are also on Spark 2.9.4. Sometimes the drops don’t occur for a couple of weeks. Other times, it can occur within 2 or 3 days of each other.

Another thing that bothers me is that if this was a common issue, why aren’t more people complaining about it? Why are we the only ones with this problem?

For now, I’ll leave it alone and maybe try to do a fresh install from scratch in a few months when I have more downtime. I’ll keep you posted.