XMPPTCPConnection: internal race condition during stream resume?

P.S: isSmResumptionPossible() doesn’t check stream:features post-auth, but only its internal streamId, which is a bit racy and not quite correct…

IIRC I deliberately did not check stream:features as it is unnecessary under the assumption that a server does not suddenly lose its stream management capabilities.

Not sure what you mean “is a bit racy and not quite correct”. Happy to fix any issues, but it is hard it you don’t actually point them out.

Also this should possibly have been a new topic.

Right, there are provisions to wait for the stream future annoucement of the server.

Please have a look at my XML from the inital post, this is what I’m bangig my head against for some days now and where I really urgently need a second pair of eyes.

Regarding the stream-features post-auth, it looks like Smack will happily cache those from pre-auth or even pre-tls, and those might be different (e.g. not contain bind or other things). Not waiting for the updated features is what I considered as “racy” in my comment.

Then I’m missing the code where you are waiting for the stream feature announcement after authentication.

I only can imagine that this could happen if two writer threads are running concurrently using the same queue and socket. But I have to think more about this, which may take some time, especially considering that the holiday seasion is starting soon.

I also experienced similar behavior when deflate compression broke because it was using the wrong dictionary. But I think you mentioned off band to me that compression is not involved here.

It is good that Smack then in-fact does wait for the updated features. Right? :slight_smile:

It’s a little bit hidden and arguably not the most elegant code (cause “organically grown”), but it’s done by the maybeCompressFeaturesReceived sync point.

Yes, you are right. I’m not using compression.

Thanks, I missed that one indeed. Now that I’ve looked at the code, there is one minor little catch in it:

        if (!config.isCompressionEnabled()) {
            return;
        }
        maybeCompressFeaturesReceived.checkIfSuccessOrWait();

This checkpoint isn’t awaited if I have compression disabled. I have a feeling that reverting the order of the two above commands will work around my race condition, but I still think thins needs to be properly investigated.

Good catch, that is indeed wrong and the order should be switched. I’ve created SMACK-846.

Does “my race condition” refer to your initial problem of intermixed elements in the stream? If so, it would be great if you could test if switching the order fixes the problem for you.

Yes, I’m referring to the XML garbage in the initial post. I’m testing the reversed-order code now for the next days, and will have to see whether the problem happens again. It was pretty hard to reproduce in the past - you need a special kind of bad network connectivity that I only experience on business trips, and it needs a background logging mechanism on the phone or a permanent adb connection.

I still think that there is something more fundamental going bad (how can you ever end up with two XML elements being written concurrently?), but if this is going to be a viable workaround, I might stop caring.

My first guess would be two writer threads using the same queue and socket.

It is not immediately clear to me how this could be caused by that. But it is the nature of concurrency problems that they are sometimes hard comprehend. So please, in order to be not biased, forget the two previous statements if you ever try to squash the bug.

It happened again. There were multiple connection attempts, with the funniest of excuses for why they couldn’t be connected.

It started with a SSLException: Read error: ssl=0x7b11645680: I/O error during system call, Software caused connection abort caused by bad connectivity.

Then it failed to connect at all:

Caused by: org.jivesoftware.smack.SmackException$ConnectionException: The following addresses failed: 'xmpp.yaxim.org:5222' failed because: /212.21.75.16 exception: java.net.ConnectException: failed to connect to /212.21.75.16 (port 5222) from /:: (port 0) after 30000ms: connect failed: ENETUNREACH (Network is unreachable), xmpp.yaxim.org/2a01:30:2000:100::16 exception: java.net.ConnectException: failed to connect to xmpp.yaxim.org/2a01:30:2000:100::16 (port 5222) from /:: (port 0) after 30000ms: connect failed: ENETUNREACH (Network is unreachable)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.connectUsingConfiguration(XMPPTCPConnection.java:661)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.connectInternal(XMPPTCPConnection.java:936)
	at org.jivesoftware.smack.AbstractXMPPConnection.connect(AbstractXMPPConnection.java:411)
	at org.yaxim.androidclient.service.SmackableImp.connectAndLogin(SmackableImp.java:718)
	... 2 more

On the next attempt after network became “available” again, it ran into the 30s TLS handshake timeout:

01-30 08:16:09.548 D/SMACK   (21303): SENT (0): <stream:stream xmlns='jabber:client' to='yax.im' xmlns:stream='http://etherx.jabber.org/streams' version='1.0' xml:lang='en'>
01-30 08:16:11.190 D/SMACK   (21303): RECV (0): <?xml version='1.0'?><stream:stream xmlns:stream='http://etherx.jabber.org/streams' xml:lang='en' from='yax.im' id='726d44e9-59fa-40a5-a034-ef76842a10cf' version='1.0' xmlns='jabber:client'><stream:features><starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'><required/></starttls></stream:features>
01-30 08:16:11.192 D/SMACK   (21303): SENT (0): <starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'></starttls>
01-30 08:16:12.172 D/SMACK   (21303): RECV (0): <proceed xmlns='urn:ietf:params:xml:ns:xmpp-tls'/>
01-30 08:16:39.538 E/yaxim.SmackableImp(21303): onDisconnected: org.yaxim.androidclient.exceptions.YaximXMPPException: connectAndLogin failed
01-30 08:16:39.539 W/System.err(21303): org.yaxim.androidclient.exceptions.YaximXMPPException: connectAndLogin failed
01-30 08:16:39.540 W/System.err(21303): 	at org.yaxim.androidclient.service.SmackableImp.connectAndLogin(SmackableImp.java:730)
01-30 08:16:39.540 W/System.err(21303): 	at org.yaxim.androidclient.service.SmackableImp.doConnect(SmackableImp.java:303)
01-30 08:16:39.540 W/System.err(21303): 	at org.yaxim.androidclient.service.SmackableImp$5.run(SmackableImp.java:388)
01-30 08:16:39.541 W/System.err(21303): Caused by: org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 30000ms (~30s). While waiting for establishing TLS
01-30 08:16:39.541 W/System.err(21303): 	at org.jivesoftware.smack.SynchronizationPoint.checkForResponse(SynchronizationPoint.java:272)
01-30 08:16:39.541 W/System.err(21303): 	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWait(SynchronizationPoint.java:157)
01-30 08:16:39.542 W/System.err(21303): 	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWaitOrThrow(SynchronizationPoint.java:128)
01-30 08:16:39.542 W/System.err(21303): 	at org.jivesoftware.smack.tcp.XMPPTCPConnection.connectInternal(XMPPTCPConnection.java:942)
01-30 08:16:39.542 W/System.err(21303): 	at org.jivesoftware.smack.AbstractXMPPConnection.connect(AbstractXMPPConnection.java:411)
01-30 08:16:39.542 W/System.err(21303): 	at org.yaxim.androidclient.service.SmackableImp.connectAndLogin(SmackableImp.java:718)
01-30 08:16:39.542 W/System.err(21303): 	... 2 more

After a 5s delay, yaxim reconnected again and sent two streams after TLS, immediately crashing with a String NPE (unfortunately, my logger doesn’t write thread PIDs to disk, but I’m pretty sure we have two threads sending things here):

01-30 08:16:49.691 D/SMACK   (21303): SENT (0): <stream:stream xmlns='jabber:client' to='yax.im' xmlns:stream='http://etherx.jabber.org/streams' version='1.0' xml:lang='en'>
01-30 08:16:52.364 D/SMACK   (21303): RECV (0): <?xml version='1.0'?><stream:stream xmlns:stream='http://etherx.jabber.org/streams' xml:lang='en' from='yax.im' id='27ac7399-17ec-4ef2-83f7-fa9e6c0c3102' version='1.0' xmlns='jabber:client'><stream:features><starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'><required/></starttls></stream:features>
01-30 08:16:52.367 D/SMACK   (21303): SENT (0): <starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'></starttls>
01-30 08:16:54.911 D/SMACK   (21303): RECV (0): <proceed xmlns='urn:ietf:params:xml:ns:xmpp-tls'/>
[TLS cert snipped]
01-30 08:17:00.974 D/SMACK   (21303): SENT (0): <stream:stream xmlns='jabber:client' to='yax.im' xmlns:stream='http://etherx.jabber.org/streams' version='1.0' xml:lang='en'>
01-30 08:17:01.511 D/SMACK   (21303): SENT (0): <stream:stream xmlns='jabber:client' to='yax.im' xmlns:stream='http://etherx.jabber.org/streams' version='1.0' xml:lang='en'>
01-30 08:17:19.690 E/yaxim.SmackableImp(21303): onDisconnected: org.yaxim.androidclient.exceptions.YaximXMPPException: connectAndLogin failed
01-30 08:17:19.690 W/System.err(21303): org.yaxim.androidclient.exceptions.YaximXMPPException: connectAndLogin failed
01-30 08:17:19.691 W/System.err(21303): 	at org.yaxim.androidclient.service.SmackableImp.connectAndLogin(SmackableImp.java:730)
01-30 08:17:19.691 W/System.err(21303): 	at org.yaxim.androidclient.service.SmackableImp.doConnect(SmackableImp.java:303)
01-30 08:17:19.691 W/System.err(21303): 	at org.yaxim.androidclient.service.SmackableImp$5.run(SmackableImp.java:388)
01-30 08:17:19.693 W/System.err(21303): Caused by: org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 30000ms (~30s). While waiting for establishing TLS
01-30 08:17:19.693 W/System.err(21303): 	at org.jivesoftware.smack.SynchronizationPoint.checkForResponse(SynchronizationPoint.java:272)
01-30 08:17:19.693 W/System.err(21303): 	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWait(SynchronizationPoint.java:157)
01-30 08:17:19.693 W/System.err(21303): 	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWaitOrThrow(SynchronizationPoint.java:128)

The String NPE is from XMPPTCPConnection trying to process the XML tag name:

final String name = parser.getName();
switch (name) {

After that, it’s not possible any more to connect this instance of XMPPTCPConnection, it will always send XML mix garbage:

01-30 08:33:10.759 D/SMACK   (21303): SENT (0): 
<iq id='4JLnJ-114289' type='get'><query xmlns='jabber:iq:roster'></query></iq>
01-30 08:33:10.759 D/SMACK   (21303): SENT (0): 
<<r xmlns='urn:xmpp:sm:3'/>
01-30 08:33:10.760 D/SMACK   (21303): SENT (0): 
iq to='yax.im' id='4JLnJ-114290' type='get'><query xmlns='http://jabber.org/protocol/disco#items'></query></iq>
01-30 08:33:59.205 D/SMACK   (21303): SENT (0):
<iq to='yax.im' id='4JLnJ-114295' type='get'><query xmlns='http://jabber.org/protocol/disco#items'></query></iq>
<stream:stream xmlns='jabber:client' to='yax.im' xmlns:stream='http://etherx.jabber.org/streams' version='1.0' xml:lang='en'>
1 Like

Please consider using stable, i.e., with the full git commit id within the URL, github links (e.g. by pressing y on the github page). It appears this particular link is laready pointed to the wrong code snipped.

is the relevant place in the code, it only appears once in the source file :slight_smile:

Do we have a stacktrace for that? It assume the reader thread would appear within that.

I wouldn’t be so sure. Two threads would not explain the duplicate stream open assuming that the used queue is properly synchronized: Then the same stream open would only be returned once. This could also be a dangling stream open in the queue (which would, of course, also be a bug).

Here we probably have two threads.

01-30 08:15:56.141 W/AbstractXMPPConnection(21303): javax.net.ssl.SSLException: Read error: ssl=0x7b11645680: I/O error during system call, Software caused connection abort
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at com.android.org.conscrypt.NativeCrypto.SSL_read(Native Method)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at com.android.org.conscrypt.OpenSSLSocketImpl$SSLInputStream.read(OpenSSLSocketImpl.java:766)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:288)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:351)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:180)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at java.io.InputStreamReader.read(InputStreamReader.java:184)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at java.io.BufferedReader.read1(BufferedReader.java:221)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at java.io.BufferedReader.read(BufferedReader.java:297)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.jivesoftware.smack.util.ObservableReader.read(ObservableReader.java:42)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.kxml2.io.KXmlParser.fillBuffer(KXmlParser.java:1516)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.kxml2.io.KXmlParser.peekType(KXmlParser.java:993)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.kxml2.io.KXmlParser.next(KXmlParser.java:349)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.kxml2.io.KXmlParser.next(KXmlParser.java:313)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.parsePackets(XMPPTCPConnection.java:1282)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.access$300(XMPPTCPConnection.java:1034)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader$1.run(XMPPTCPConnection.java:1050)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at java.lang.Thread.run(Thread.java:764)
01-30 08:15:56.142 V/NativeCrypto(6266): SSL shutdown failed: ssl=0x7afeec9ec8: I/O error during system call, Broken pipe

Is it possibly related to completing the TLS handshake the first time, but then aborting with a timeout?

I have created SMACK-855.

A first attempt to fix this can be found at

Please report back your experience with that patch.

I’ve replaced my aggressive killing code and abort-flag with your code. I’ve also added a synchronized (mXMPPConnection) around the connect+login calls, to turn those two into an atomic action.

Since then, I’ve run into issues where my previous connecting thread was properly terminated + joined, but the new one still blocks on the mXMPPConnection semaphore, forever. Unfortunately, this happens when I’m on the run, and I can’t attach the debugger after the fact.

A thread dump when this situation arises would be very helpful.

Ah, that’s great to keep in mind. I just managed to reproduce the issue and to attach the debugger, and the connect() thread isn’t hanging in my synchronized, which I expected, but in readerWriterSemaphore.acquire(2); in initConnection(), so apparently the previously acquired semaphore isn’t released. Attached the trace file (limited to the yaxim PID, tid=20 is the relevant thread that’s hanging).

yaxim-trace.txt (73.5 KB)

Thanks, this means that

LOGGER.log(Level.FINE, "Not every reader/writer threads where terminated on connection re-initializtion of {0}. Available permits {1}", logObjects);

should be logged. You probably want to print FINE logs for XMPPTCPConnection in the adb log.

Indeed. There is a stray Smack Writer thread tid=27 which is probably the reason why the semaphore was not incremented yet.

"Smack Writer (0)" daemon prio=5 tid=27 Waiting
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13c41b60 self=0x77bb146e00
  | sysTid=30938 nice=0 cgrp=default sched=0/0 handle=0x778c0d34f0
  | state=S schedstat=( 6130306 8063154 9 ) utm=0 stm=0 core=2 HZ=100
  | stack=0x778bfd1000-0x778bfd3000 stackSize=1037KB
  | held mutexes=
  at java.lang.Object.wait(Native method)
  - waiting on <0x029530d5> (a java.lang.Object)
  at java.lang.Thread.parkFor$(Thread.java:2135)
  - locked <0x029530d5> (a java.lang.Object)
  at sun.misc.Unsafe.park(Unsafe.java:358)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:190)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2059)
  at org.jivesoftware.smack.util.ArrayBlockingQueueWithShutdown.take(ArrayBlockingQueueWithShutdown.java:303)
  at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.nextStreamElement(XMPPTCPConnection.java:1431)
  at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.writePackets(XMPPTCPConnection.java:1449)
  at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.access$3400(XMPPTCPConnection.java:1303)
  at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter$1.run(XMPPTCPConnection.java:1352)
  at java.lang.Thread.run(Thread.java:764)

The reader and writer reads are to increment the semaphore if the exit. As of now I couldn’t come up with a hypothesis why the writer thread did not exit. There is a sentinel which checks if the writer thread exited on shutdown

LOGGER.log(Level.WARNING, "shutdownDone was not marked as successful by the writer thread", e);

does that appear in your logs?