XMPPTCPConnection: internal race condition during stream resume?

Flow · March 4, 2019, 2:16pm

Thanks, this means that

LOGGER.log(Level.FINE, "Not every reader/writer threads where terminated on connection re-initializtion of {0}. Available permits {1}", logObjects);

should be logged. You probably want to print FINE logs for XMPPTCPConnection in the adb log.

Indeed. There is a stray Smack Writer thread tid=27 which is probably the reason why the semaphore was not incremented yet.

"Smack Writer (0)" daemon prio=5 tid=27 Waiting
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13c41b60 self=0x77bb146e00
  | sysTid=30938 nice=0 cgrp=default sched=0/0 handle=0x778c0d34f0
  | state=S schedstat=( 6130306 8063154 9 ) utm=0 stm=0 core=2 HZ=100
  | stack=0x778bfd1000-0x778bfd3000 stackSize=1037KB
  | held mutexes=
  at java.lang.Object.wait(Native method)
  - waiting on <0x029530d5> (a java.lang.Object)
  at java.lang.Thread.parkFor$(Thread.java:2135)
  - locked <0x029530d5> (a java.lang.Object)
  at sun.misc.Unsafe.park(Unsafe.java:358)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:190)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2059)
  at org.jivesoftware.smack.util.ArrayBlockingQueueWithShutdown.take(ArrayBlockingQueueWithShutdown.java:303)
  at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.nextStreamElement(XMPPTCPConnection.java:1431)
  at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.writePackets(XMPPTCPConnection.java:1449)
  at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.access$3400(XMPPTCPConnection.java:1303)
  at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter$1.run(XMPPTCPConnection.java:1352)
  at java.lang.Thread.run(Thread.java:764)

The reader and writer reads are to increment the semaphore if the exit. As of now I couldn’t come up with a hypothesis why the writer thread did not exit. There is a sentinel which checks if the writer thread exited on shutdown

LOGGER.log(Level.WARNING, "shutdownDone was not marked as successful by the writer thread", e);

does that appear in your logs?

Georg_Lukas · March 4, 2019, 3:17pm

Unfortunately my logs didn’t go that far. I’ll keep an eye open for this as well.

Yeah, but I still didn’t have taken the time to dig into JUL and to understand what exactly I need to paste into where to make this work. If you have a simple example I can steal, please let me know.

Georg_Lukas · March 4, 2019, 3:45pm

I’ve just run into the same situation, and there is nothing in the log since before the old connection failed. Furthermore, it looks like there are two PacketWriters in existence, both waiting on their respective queue.take() (different queue objects), and neither queue is isShutdown.

Looking back into my own code, I only call c.instantShutdown() if it reports isConnected()==true, which is obviously not the case. I’m not sure which precondition would be a better match to run instantShutdown on, or if I can just hammer it anyway?

Flow · March 4, 2019, 4:02pm

That. If you do the instantShutdown() conditionally, that would also explain why you experiencing it whereas I can not remember ever running into it.

I also have a root cause hypothesis: It all comes down to state destruction on error/exceptions. Something which I considered at the very beginning when desiging the new state machine connection types, but which was never was done properly in the existing (legacy) connection types. Assume for example that this sync point throws (or the sync point right below), which is not unlikely in slow/unreliable mobile connections, then an following call to connect(), if there was no instantShutdown() or shutdown() in between, will possibly stall due the reader and writer threads not being terminated. I will see what can be done about it.

Flow · March 4, 2019, 4:11pm

Something like this

Georg_Lukas · March 4, 2019, 4:46pm

Except this will destroy the 0198 state during a reconnect-gone-wrong. You probably need to expose instantShutdown() one layer up and call that instead.

Flow · March 4, 2019, 5:04pm

Right. For now it would be great if you would invoke instantShutdown() unconditionally and report back your findings.

Georg_Lukas · March 4, 2019, 8:28pm

Will do so. Thanks so far.

Georg_Lukas · March 16, 2019, 1:17am

I’ve deployed the latest smack 4.3-fixes branch and am calling instantShutdown unconditionally before connecting. I still have a stall in the semaphore, traces attached: traces-yaxim-bruno.txt (39.0 KB)

No, I haven’t dug into JUL yet, so no useful logs for you, sorry.

Flow · March 16, 2019, 9:45am

Thanks for reporting

This appears to be caused by a synchronized (which became superfluous due the latest changes).

Could you try

github.com/Flowdalic/Smack

Remove 'synchronized' from notifyConnectionError()

committed 09:42AM - 16 Mar 19 UTC

Flowdalic

+1 -1

as it is not neccessary and causes stalls and deadlocks with a concurrent connec…t()/login() (which could be resolved interrupting the connect()/login() calling thread): "Smack Reader (0)" daemon prio=5 tid=21 Blocked | group="main" sCount=1 dsCount=0 obj=0x12c84670 self=0x7e072ca200 | sysTid=14965 nice=0 cgrp=default sched=0/0 handle=0x7e06155450 | state=S schedstat=( 63430626 3034269 21 ) utm=5 stm=0 core=2 HZ=100 | stack=0x7e06053000-0x7e06055000 stackSize=1037KB | held mutexes= at org.jivesoftware.smack.tcp.XMPPTCPConnection.notifyConnectionError(XMPPTCPConnection.java:-1) - waiting to lock <0x0ec6e6bb> (a org.jivesoftware.smack.tcp.XMPPTCPConnection) held by thread 22 at org.jivesoftware.smack.tcp.XMPPTCPConnection.access$3900(XMPPTCPConnection.java:154) at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.parsePackets(XMPPTCPConnection.java:1330) at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.access$900(XMPPTCPConnection.java:1064) at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader$1.run(XMPPTCPConnection.java:1081) at java.lang.Thread.run(Thread.java:761) "connect" prio=5 tid=22 Waiting | group="main" sCount=1 dsCount=0 obj=0x12c8e820 self=0x7e19bcd600 | sysTid=15047 nice=0 cgrp=default sched=0/0 handle=0x7e05ee4450 | state=S schedstat=( 14067083 7475835 14 ) utm=1 stm=0 core=0 HZ=100 | stack=0x7e05de2000-0x7e05de4000 stackSize=1037KB | held mutexes= at java.lang.Object.wait!(Native method) - waiting on <0x0c058b14> (a java.lang.Object) at java.lang.Thread.parkFor$(Thread.java:2127) - locked <0x0c058b14> (a java.lang.Object) at sun.misc.Unsafe.park(Unsafe.java:325) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:161) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:840) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at java.util.concurrent.Semaphore.acquire(Semaphore.java:446) at org.jivesoftware.smack.tcp.XMPPTCPConnection.initConnection(XMPPTCPConnection.java:685) at org.jivesoftware.smack.tcp.XMPPTCPConnection.connectInternal(XMPPTCPConnection.java:942) at org.jivesoftware.smack.AbstractXMPPConnection.connect(AbstractXMPPConnection.java:417) - locked <0x0ec6e6bb> (a org.jivesoftware.smack.tcp.XMPPTCPConnection) at org.yaxim.androidclient.service.SmackableImp.connectAndLogin(SmackableImp.java:717) - locked <0x0ec6e6bb> (a org.jivesoftware.smack.tcp.XMPPTCPConnection) at org.yaxim.androidclient.service.SmackableImp.doConnect(SmackableImp.java:304) at org.yaxim.androidclient.service.SmackableImp$5.run(SmackableImp.java:391)

Georg_Lukas · March 19, 2019, 8:15am

I’ve tried your latest 4.3 branch (https://github.com/Flowdalic/Smack/commit/7059b60672628d1427776a794d396fb09b2ed42b) and it ended up blocking in mXMPPConnection.instantShutdown(). I didn’t have a chance to make an anr snapshot of it though, sorry.

Georg_Lukas · March 19, 2019, 11:55am

Oh, I’m sorry. I’ve just fetched the anr log and realized that I’m still on 4.3.3-SNAPSHOT and not on 4.3.4-SNAPSHOT. Please ignore my last message from today.

Georg_Lukas · March 21, 2019, 11:38am

I’ve re-evaluated with the correct code now. It still hangs. Apparently, the race condition is triggered by a timeout during login (not connect), and after login() fails with a NoResponseException:

Caused by: org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 30000ms (~30s). While waiting for successful SASL authentication
   at org.jivesoftware.smack.SASLAuthentication.authenticate(SASLAuthentication.java:224)
   at org.jivesoftware.smack.tcp.XMPPTCPConnection.loginInternal(XMPPTCPConnection.java:402)
   at org.jivesoftware.smack.AbstractXMPPConnection.login(AbstractXMPPConnection.java:528)
   at org.yaxim.androidclient.service.SmackableImp.connectAndLogin(SmackableImp.java:724)
   ... 2 more

…the PacketReader is still awaiting data from the socket, which it eventually receives after my call to instantShutdown() completes - so it looks like instantShutdown() is not properly tearing down the socket or something else is failing. Because of that, connectInternal() keeps hanging on the readerWriterSemaphore and I’m deadlocked.

traces-yaxim-2019-03-21.txt (74.5 KB)

I would provide Smack logs, but I just sunk an hour into attempting to beat Android JUL into dumping the Smack logs in one way or another, and failed.

Flow · March 21, 2019, 11:56am

That is strange. I would expect the reader/writer threads to have terminated by the time instantShutdown() (or shutdown() FWIW) returns, especially with the latest changes. Could you verify that you see the following code block at the very end of shutdown:

    // Wait for reader and writer threads to be terminated.
    readerWriterSemaphore.acquireUninterruptibly(2);
    readerWriterSemaphore.release(2);

in the Smack version you are using?

Georg_Lukas · March 21, 2019, 12:28pm

Yes, it is there. I have no idea how it got around that.

Georg_Lukas · March 21, 2019, 1:14pm

Not sure what I did, but all the cargo-culted junk in my project is working now. Next time it deadlocks when I’m near to a computer, I’ll be able to provide FINEST logs.

Georg_Lukas · March 25, 2019, 11:25am

Here we are again, with logs! 2019-03-24-deadlock-log.txt (50.4 KB)

What happens in the log:

18:25:41 - yaxim loses network connectivity
18:26:21 - a TCP connection is re-established and reader and writer are initialized on it
18:26:51 - connection times out while waiting for TLS, only the reader is closed, but not the writer
a bunch of re-connection attempts without success happen
18:28:57 - we connect again, get the evil error logged:

03-24 18:28:57.841 10056 29748 D XMPPTCPConnection: Not every reader/writer threads
      where terminated on connection re-initializtion of {0}. Available permits {1}

18:30:03 the blocked thread gets interrupted, a new connection attempt is made and locks up in the same place
…ad infinitum…

From the 2019-03-24-traces.txt file it looks like we end up with two Smack Writers, but those are on different connections, so maybe this is a consequence of changing the yaxim preferences. Anyway, the writer waiting on the queue is not something normal, and the force-instant-shutdown! log indicates that I’m doing my duty there. My connecting code:

try {
	debugLog("connectAndLogin: force-instant-shutdown!");
	mXMPPConnection.instantShutdown(); // blocking shutdown prior to re-connection
} catch (Exception e) {
	debugLog("conn.shutdown() failed, ignoring: " + e);
}
debugLog("connectAndLogin: entering synchronized mXMPPConnection");
synchronized (mXMPPConnection) {
	debugLog("connectAndLogin: within synchronized mXMPPConnection");
	mXMPPConnection.connect();
	if (create_account) { /** snipped */ }
	mXMPPConnection.login(mConfig.userName, mConfig.password,
			Resourcepart.from(mConfig.ressource));
}
debugLog("connectAndLogin: left synchronized mXMPPConnection");

Flow · March 25, 2019, 12:32pm

It looks like the disconnectedButResumable flag is set to true, since we don’t see the reader/writer threads being shutdown log messages. I wonder if the shutdown() method should be changed so that the threads and possibly the socket, are always shut down.

I am still not sure why we see a writer thread and not an accompanying reader thread. They should only exist in pairs, but that is obviously not the case. Hence as first action I have commited

which hopefully helps to pin down the exact situation where only the reader thread exist but not the write thread.

I will also stare at the fine debug logs you provided and muse about them in deep concentration when I find the required quite time. Maybe that will also bring light into the darkness.

Georg_Lukas · March 28, 2019, 10:53am

XMPPTCPConnection log when reader/writer threads start and exit · Flowdalic/Smack@35f0aa9 · GitHub

I go along and raise with XXX: even moar readerWriterSemaphore logging · ge0rg/smack4@ceb83aa · GitHub

The respective code created the following log, which shows that the Smack Writer isn’t terminating and that connect() hangs eternally in the semaphore: 2019-03-28-smack-shutdown.txt (24.7 KB)

Relevant events:

09:00:02 new re-connection attempt (after being connected successfully before), reader and writer get created
09:00:40 we send a that never gets answered
09:01:51 the yaxim connection timeout handler triggers a reconnect
09:01:53 Smack Reader terminates in anticipation of the connection breakage (not sure why this happens first)
09:01:54 login() is aborted with an InterruptedException
09:01:54 new connection and login attempt is started, hangs in Not every reader/writer threads...

The most relevant part is this:

03-28 09:01:54.175 D/yaxim.SmackableImp( 2489): connectAndLogin: force-instant-shutdown!
03-28 09:01:54.175 D/yaxim.SmackableImp( 2489): connectAndLogin: entering synchronized mXMPPConnection
03-28 09:01:54.175 D/yaxim.SmackableImp( 2489): connectAndLogin: within synchronized mXMPPConnection
03-28 09:01:54.178 D/XMPPTCPConnection( 2489): Trying to establish TCP connection to /212.21.75.16:5222

I call instantShutdown() which does nothing… because… disconnectedButResumeable must apparently be true from the previous connection. That variable is only set to false in afterSuccessfulLogin() which is apparently a bit misnamed because it is only called after stream resumption completed or failed, and not after authentication. Anyway, because this only gets reset after a successful session setup, an instant disconnect during connection/login/resumption is bound to fail and deadlock.

We should probably change the pre-condition check in shutdown(instant=true) in some appropriate way. Ideas?

Flow · March 28, 2019, 2:22pm

There is a afterSuccessfulLogin(false) in L479 (at the end of XMPPTCPConnection.loginInternal()). So I’d argue it is not misnamed.

My first idea how this potentially could be solved is

As always, if you have the ability to test the change and report your findings back, then this is highly appreciated. But please beware that this is untested code.