powered by Jive Software

XMPPTCPConnection: internal race condition during stream resume?


#21

Thanks, this means that

LOGGER.log(Level.FINE, "Not every reader/writer threads where terminated on connection re-initializtion of {0}. Available permits {1}", logObjects);

should be logged. You probably want to print FINE logs for XMPPTCPConnection in the adb log.

Indeed. There is a stray Smack Writer thread tid=27 which is probably the reason why the semaphore was not incremented yet.

"Smack Writer (0)" daemon prio=5 tid=27 Waiting
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13c41b60 self=0x77bb146e00
  | sysTid=30938 nice=0 cgrp=default sched=0/0 handle=0x778c0d34f0
  | state=S schedstat=( 6130306 8063154 9 ) utm=0 stm=0 core=2 HZ=100
  | stack=0x778bfd1000-0x778bfd3000 stackSize=1037KB
  | held mutexes=
  at java.lang.Object.wait(Native method)
  - waiting on <0x029530d5> (a java.lang.Object)
  at java.lang.Thread.parkFor$(Thread.java:2135)
  - locked <0x029530d5> (a java.lang.Object)
  at sun.misc.Unsafe.park(Unsafe.java:358)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:190)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2059)
  at org.jivesoftware.smack.util.ArrayBlockingQueueWithShutdown.take(ArrayBlockingQueueWithShutdown.java:303)
  at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.nextStreamElement(XMPPTCPConnection.java:1431)
  at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.writePackets(XMPPTCPConnection.java:1449)
  at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.access$3400(XMPPTCPConnection.java:1303)
  at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter$1.run(XMPPTCPConnection.java:1352)
  at java.lang.Thread.run(Thread.java:764)

The reader and writer reads are to increment the semaphore if the exit. As of now I couldn’t come up with a hypothesis why the writer thread did not exit. There is a sentinel which checks if the writer thread exited on shutdown

LOGGER.log(Level.WARNING, "shutdownDone was not marked as successful by the writer thread", e);

does that appear in your logs?


#22

Unfortunately my logs didn’t go that far. I’ll keep an eye open for this as well.

Yeah, but I still didn’t have taken the time to dig into JUL and to understand what exactly I need to paste into where to make this work. If you have a simple example I can steal, please let me know.


#23

I’ve just run into the same situation, and there is nothing in the log since before the old connection failed. Furthermore, it looks like there are two PacketWriters in existence, both waiting on their respective queue.take() (different queue objects), and neither queue is isShutdown.

Looking back into my own code, I only call c.instantShutdown() if it reports isConnected()==true, which is obviously not the case. I’m not sure which precondition would be a better match to run instantShutdown on, or if I can just hammer it anyway?


#24

That. If you do the instantShutdown() conditionally, that would also explain why you experiencing it whereas I can not remember ever running into it.

I also have a root cause hypothesis: It all comes down to state destruction on error/exceptions. Something which I considered at the very beginning when desiging the new state machine connection types, but which was never was done properly in the existing (legacy) connection types. Assume for example that this sync point throws (or the sync point right below), which is not unlikely in slow/unreliable mobile connections, then an following call to connect(), if there was no instantShutdown() or shutdown() in between, will possibly stall due the reader and writer threads not being terminated. I will see what can be done about it.


#25

Something like this


#26

Except this will destroy the 0198 state during a reconnect-gone-wrong. You probably need to expose instantShutdown() one layer up and call that instead.


#27

Right. For now it would be great if you would invoke instantShutdown() unconditionally and report back your findings.


#28

Will do so. Thanks so far.


#29

I’ve deployed the latest smack 4.3-fixes branch and am calling instantShutdown unconditionally before connecting. I still have a stall in the semaphore, traces attached: traces-yaxim-bruno.txt (39.0 KB)

No, I haven’t dug into JUL yet, so no useful logs for you, sorry.


#30

Thanks for reporting :slight_smile:

This appears to be caused by a synchronized (which became superfluous due the latest changes).

Could you try


#31

I’ve tried your latest 4.3 branch (https://github.com/Flowdalic/Smack/commit/7059b60672628d1427776a794d396fb09b2ed42b) and it ended up blocking in mXMPPConnection.instantShutdown(). I didn’t have a chance to make an anr snapshot of it though, sorry.


#32

Oh, I’m sorry. I’ve just fetched the anr log and realized that I’m still on 4.3.3-SNAPSHOT and not on 4.3.4-SNAPSHOT. Please ignore my last message from today.


#33

I’ve re-evaluated with the correct code now. It still hangs. Apparently, the race condition is triggered by a timeout during login (not connect), and after login() fails with a NoResponseException:

Caused by: org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 30000ms (~30s). While waiting for successful SASL authentication
   at org.jivesoftware.smack.SASLAuthentication.authenticate(SASLAuthentication.java:224)
   at org.jivesoftware.smack.tcp.XMPPTCPConnection.loginInternal(XMPPTCPConnection.java:402)
   at org.jivesoftware.smack.AbstractXMPPConnection.login(AbstractXMPPConnection.java:528)
   at org.yaxim.androidclient.service.SmackableImp.connectAndLogin(SmackableImp.java:724)
   ... 2 more

…the PacketReader is still awaiting data from the socket, which it eventually receives after my call to instantShutdown() completes - so it looks like instantShutdown() is not properly tearing down the socket or something else is failing. Because of that, connectInternal() keeps hanging on the readerWriterSemaphore and I’m deadlocked.

traces-yaxim-2019-03-21.txt (74.5 KB)

I would provide Smack logs, but I just sunk an hour into attempting to beat Android JUL into dumping the Smack logs in one way or another, and failed.


#34

That is strange. I would expect the reader/writer threads to have terminated by the time instantShutdown() (or shutdown() FWIW) returns, especially with the latest changes. Could you verify that you see the following code block at the very end of shutdown:

    // Wait for reader and writer threads to be terminated.
    readerWriterSemaphore.acquireUninterruptibly(2);
    readerWriterSemaphore.release(2);

in the Smack version you are using?


#35

Yes, it is there. I have no idea how it got around that.


#36

Not sure what I did, but all the cargo-culted junk in my project is working now. Next time it deadlocks when I’m near to a computer, I’ll be able to provide FINEST logs.