powered by Jive Software

XMPPTCPConnection: internal race condition during stream resume?


#1

It looks like there are no provisions in Smack to actually wait for and process a <stream:stream> from the server. I’m seeing very very weird behavior in my implementation during stream resumption, where outgoing <stream:stream> and <resume> are sent in parallel:

12-19 07:40:33.821 D/SMACK   (13162): SENT (1): <stream:stream xmlns='jabber:client' to='yax.im' xmlns:stream='http://etherx.jabber.org/streams' version='1.0' id='1ffda9a8-b175-4326-9bfc-de590b1bb0d8<'resume  xmlns='urn:xmpp:sm:3'xml:lang ='hen=''29686>
12-19 07:40:33.822 D/SMACK   (13162): SENT (1): ' previd='79042661-3d09-49bd-86ad-f10abc4140db'/>

I don’t understand yet how this is happening (both are triggered from loginInternal which is synchronized, so I don’t think it’s related to my threading). OTOH, both elements are put into the PacketWriter.queue so I wonder how they end up intermixed.

Any ideas?


#2

P.S: isSmResumptionPossible() doesn’t check stream:features post-auth, but only its internal streamId, which is a bit racy and not quite correct…


#3

IIRC I deliberately did not check stream:features as it is unnecessary under the assumption that a server does not suddenly lose its stream management capabilities.

Not sure what you mean “is a bit racy and not quite correct”. Happy to fix any issues, but it is hard it you don’t actually point them out.

Also this should possibly have been a new topic.


#4

Right, there are provisions to wait for the stream future annoucement of the server.


#5

Please have a look at my XML from the inital post, this is what I’m bangig my head against for some days now and where I really urgently need a second pair of eyes.

Regarding the stream-features post-auth, it looks like Smack will happily cache those from pre-auth or even pre-tls, and those might be different (e.g. not contain bind or other things). Not waiting for the updated features is what I considered as “racy” in my comment.

Then I’m missing the code where you are waiting for the stream feature announcement after authentication.


#6

I only can imagine that this could happen if two writer threads are running concurrently using the same queue and socket. But I have to think more about this, which may take some time, especially considering that the holiday seasion is starting soon.

I also experienced similar behavior when deflate compression broke because it was using the wrong dictionary. But I think you mentioned off band to me that compression is not involved here.

It is good that Smack then in-fact does wait for the updated features. Right? :slight_smile:

It’s a little bit hidden and arguably not the most elegant code (cause “organically grown”), but it’s done by the maybeCompressFeaturesReceived sync point.


#7

Yes, you are right. I’m not using compression.

Thanks, I missed that one indeed. Now that I’ve looked at the code, there is one minor little catch in it:

        if (!config.isCompressionEnabled()) {
            return;
        }
        maybeCompressFeaturesReceived.checkIfSuccessOrWait();

This checkpoint isn’t awaited if I have compression disabled. I have a feeling that reverting the order of the two above commands will work around my race condition, but I still think thins needs to be properly investigated.


#8

Good catch, that is indeed wrong and the order should be switched. I’ve created SMACK-846.

Does “my race condition” refer to your initial problem of intermixed elements in the stream? If so, it would be great if you could test if switching the order fixes the problem for you.


#9

Yes, I’m referring to the XML garbage in the initial post. I’m testing the reversed-order code now for the next days, and will have to see whether the problem happens again. It was pretty hard to reproduce in the past - you need a special kind of bad network connectivity that I only experience on business trips, and it needs a background logging mechanism on the phone or a permanent adb connection.

I still think that there is something more fundamental going bad (how can you ever end up with two XML elements being written concurrently?), but if this is going to be a viable workaround, I might stop caring.


#10

My first guess would be two writer threads using the same queue and socket.

It is not immediately clear to me how this could be caused by that. But it is the nature of concurrency problems that they are sometimes hard comprehend. So please, in order to be not biased, forget the two previous statements if you ever try to squash the bug.


#11

It happened again. There were multiple connection attempts, with the funniest of excuses for why they couldn’t be connected.

It started with a SSLException: Read error: ssl=0x7b11645680: I/O error during system call, Software caused connection abort caused by bad connectivity.

Then it failed to connect at all:

Caused by: org.jivesoftware.smack.SmackException$ConnectionException: The following addresses failed: 'xmpp.yaxim.org:5222' failed because: /212.21.75.16 exception: java.net.ConnectException: failed to connect to /212.21.75.16 (port 5222) from /:: (port 0) after 30000ms: connect failed: ENETUNREACH (Network is unreachable), xmpp.yaxim.org/2a01:30:2000:100::16 exception: java.net.ConnectException: failed to connect to xmpp.yaxim.org/2a01:30:2000:100::16 (port 5222) from /:: (port 0) after 30000ms: connect failed: ENETUNREACH (Network is unreachable)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.connectUsingConfiguration(XMPPTCPConnection.java:661)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.connectInternal(XMPPTCPConnection.java:936)
	at org.jivesoftware.smack.AbstractXMPPConnection.connect(AbstractXMPPConnection.java:411)
	at org.yaxim.androidclient.service.SmackableImp.connectAndLogin(SmackableImp.java:718)
	... 2 more

On the next attempt after network became “available” again, it ran into the 30s TLS handshake timeout:

01-30 08:16:09.548 D/SMACK   (21303): SENT (0): <stream:stream xmlns='jabber:client' to='yax.im' xmlns:stream='http://etherx.jabber.org/streams' version='1.0' xml:lang='en'>
01-30 08:16:11.190 D/SMACK   (21303): RECV (0): <?xml version='1.0'?><stream:stream xmlns:stream='http://etherx.jabber.org/streams' xml:lang='en' from='yax.im' id='726d44e9-59fa-40a5-a034-ef76842a10cf' version='1.0' xmlns='jabber:client'><stream:features><starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'><required/></starttls></stream:features>
01-30 08:16:11.192 D/SMACK   (21303): SENT (0): <starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'></starttls>
01-30 08:16:12.172 D/SMACK   (21303): RECV (0): <proceed xmlns='urn:ietf:params:xml:ns:xmpp-tls'/>
01-30 08:16:39.538 E/yaxim.SmackableImp(21303): onDisconnected: org.yaxim.androidclient.exceptions.YaximXMPPException: connectAndLogin failed
01-30 08:16:39.539 W/System.err(21303): org.yaxim.androidclient.exceptions.YaximXMPPException: connectAndLogin failed
01-30 08:16:39.540 W/System.err(21303): 	at org.yaxim.androidclient.service.SmackableImp.connectAndLogin(SmackableImp.java:730)
01-30 08:16:39.540 W/System.err(21303): 	at org.yaxim.androidclient.service.SmackableImp.doConnect(SmackableImp.java:303)
01-30 08:16:39.540 W/System.err(21303): 	at org.yaxim.androidclient.service.SmackableImp$5.run(SmackableImp.java:388)
01-30 08:16:39.541 W/System.err(21303): Caused by: org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 30000ms (~30s). While waiting for establishing TLS
01-30 08:16:39.541 W/System.err(21303): 	at org.jivesoftware.smack.SynchronizationPoint.checkForResponse(SynchronizationPoint.java:272)
01-30 08:16:39.541 W/System.err(21303): 	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWait(SynchronizationPoint.java:157)
01-30 08:16:39.542 W/System.err(21303): 	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWaitOrThrow(SynchronizationPoint.java:128)
01-30 08:16:39.542 W/System.err(21303): 	at org.jivesoftware.smack.tcp.XMPPTCPConnection.connectInternal(XMPPTCPConnection.java:942)
01-30 08:16:39.542 W/System.err(21303): 	at org.jivesoftware.smack.AbstractXMPPConnection.connect(AbstractXMPPConnection.java:411)
01-30 08:16:39.542 W/System.err(21303): 	at org.yaxim.androidclient.service.SmackableImp.connectAndLogin(SmackableImp.java:718)
01-30 08:16:39.542 W/System.err(21303): 	... 2 more

After a 5s delay, yaxim reconnected again and sent two streams after TLS, immediately crashing with a String NPE (unfortunately, my logger doesn’t write thread PIDs to disk, but I’m pretty sure we have two threads sending things here):

01-30 08:16:49.691 D/SMACK   (21303): SENT (0): <stream:stream xmlns='jabber:client' to='yax.im' xmlns:stream='http://etherx.jabber.org/streams' version='1.0' xml:lang='en'>
01-30 08:16:52.364 D/SMACK   (21303): RECV (0): <?xml version='1.0'?><stream:stream xmlns:stream='http://etherx.jabber.org/streams' xml:lang='en' from='yax.im' id='27ac7399-17ec-4ef2-83f7-fa9e6c0c3102' version='1.0' xmlns='jabber:client'><stream:features><starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'><required/></starttls></stream:features>
01-30 08:16:52.367 D/SMACK   (21303): SENT (0): <starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'></starttls>
01-30 08:16:54.911 D/SMACK   (21303): RECV (0): <proceed xmlns='urn:ietf:params:xml:ns:xmpp-tls'/>
[TLS cert snipped]
01-30 08:17:00.974 D/SMACK   (21303): SENT (0): <stream:stream xmlns='jabber:client' to='yax.im' xmlns:stream='http://etherx.jabber.org/streams' version='1.0' xml:lang='en'>
01-30 08:17:01.511 D/SMACK   (21303): SENT (0): <stream:stream xmlns='jabber:client' to='yax.im' xmlns:stream='http://etherx.jabber.org/streams' version='1.0' xml:lang='en'>
01-30 08:17:19.690 E/yaxim.SmackableImp(21303): onDisconnected: org.yaxim.androidclient.exceptions.YaximXMPPException: connectAndLogin failed
01-30 08:17:19.690 W/System.err(21303): org.yaxim.androidclient.exceptions.YaximXMPPException: connectAndLogin failed
01-30 08:17:19.691 W/System.err(21303): 	at org.yaxim.androidclient.service.SmackableImp.connectAndLogin(SmackableImp.java:730)
01-30 08:17:19.691 W/System.err(21303): 	at org.yaxim.androidclient.service.SmackableImp.doConnect(SmackableImp.java:303)
01-30 08:17:19.691 W/System.err(21303): 	at org.yaxim.androidclient.service.SmackableImp$5.run(SmackableImp.java:388)
01-30 08:17:19.693 W/System.err(21303): Caused by: org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 30000ms (~30s). While waiting for establishing TLS
01-30 08:17:19.693 W/System.err(21303): 	at org.jivesoftware.smack.SynchronizationPoint.checkForResponse(SynchronizationPoint.java:272)
01-30 08:17:19.693 W/System.err(21303): 	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWait(SynchronizationPoint.java:157)
01-30 08:17:19.693 W/System.err(21303): 	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWaitOrThrow(SynchronizationPoint.java:128)

The String NPE is from XMPPTCPConnection trying to process the XML tag name:

final String name = parser.getName();
switch (name) {

After that, it’s not possible any more to connect this instance of XMPPTCPConnection, it will always send XML mix garbage:

01-30 08:33:10.759 D/SMACK   (21303): SENT (0): 
<iq id='4JLnJ-114289' type='get'><query xmlns='jabber:iq:roster'></query></iq>
01-30 08:33:10.759 D/SMACK   (21303): SENT (0): 
<<r xmlns='urn:xmpp:sm:3'/>
01-30 08:33:10.760 D/SMACK   (21303): SENT (0): 
iq to='yax.im' id='4JLnJ-114290' type='get'><query xmlns='http://jabber.org/protocol/disco#items'></query></iq>
01-30 08:33:59.205 D/SMACK   (21303): SENT (0):
<iq to='yax.im' id='4JLnJ-114295' type='get'><query xmlns='http://jabber.org/protocol/disco#items'></query></iq>
<stream:stream xmlns='jabber:client' to='yax.im' xmlns:stream='http://etherx.jabber.org/streams' version='1.0' xml:lang='en'>

#12

Please consider using stable, i.e., with the full git commit id within the URL, github links (e.g. by pressing y on the github page). It appears this particular link is laready pointed to the wrong code snipped.


#13

is the relevant place in the code, it only appears once in the source file :slight_smile:


#14

Do we have a stacktrace for that? It assume the reader thread would appear within that.


#15

I wouldn’t be so sure. Two threads would not explain the duplicate stream open assuming that the used queue is properly synchronized: Then the same stream open would only be returned once. This could also be a dangling stream open in the queue (which would, of course, also be a bug).

Here we probably have two threads.


#16
01-30 08:15:56.141 W/AbstractXMPPConnection(21303): javax.net.ssl.SSLException: Read error: ssl=0x7b11645680: I/O error during system call, Software caused connection abort
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at com.android.org.conscrypt.NativeCrypto.SSL_read(Native Method)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at com.android.org.conscrypt.OpenSSLSocketImpl$SSLInputStream.read(OpenSSLSocketImpl.java:766)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:288)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:351)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:180)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at java.io.InputStreamReader.read(InputStreamReader.java:184)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at java.io.BufferedReader.read1(BufferedReader.java:221)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at java.io.BufferedReader.read(BufferedReader.java:297)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.jivesoftware.smack.util.ObservableReader.read(ObservableReader.java:42)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.kxml2.io.KXmlParser.fillBuffer(KXmlParser.java:1516)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.kxml2.io.KXmlParser.peekType(KXmlParser.java:993)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.kxml2.io.KXmlParser.next(KXmlParser.java:349)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.kxml2.io.KXmlParser.next(KXmlParser.java:313)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.parsePackets(XMPPTCPConnection.java:1282)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.access$300(XMPPTCPConnection.java:1034)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader$1.run(XMPPTCPConnection.java:1050)
01-30 08:15:56.141 W/AbstractXMPPConnection(21303):     at java.lang.Thread.run(Thread.java:764)
01-30 08:15:56.142 V/NativeCrypto(6266): SSL shutdown failed: ssl=0x7afeec9ec8: I/O error during system call, Broken pipe

Is it possibly related to completing the TLS handshake the first time, but then aborting with a timeout?


#17

I have created SMACK-855.

A first attempt to fix this can be found at

Please report back your experience with that patch.


#18

I’ve replaced my aggressive killing code and abort-flag with your code. I’ve also added a synchronized (mXMPPConnection) around the connect+login calls, to turn those two into an atomic action.

Since then, I’ve run into issues where my previous connecting thread was properly terminated + joined, but the new one still blocks on the mXMPPConnection semaphore, forever. Unfortunately, this happens when I’m on the run, and I can’t attach the debugger after the fact.


#19

A thread dump when this situation arises would be very helpful.


#20

Ah, that’s great to keep in mind. I just managed to reproduce the issue and to attach the debugger, and the connect() thread isn’t hanging in my synchronized, which I expected, but in readerWriterSemaphore.acquire(2); in initConnection(), so apparently the previously acquired semaphore isn’t released. Attached the trace file (limited to the yaxim PID, tid=20 is the relevant thread that’s hanging).

yaxim-trace.txt (73.5 KB)