XMPPConnection constructor returns object before TLS negotiation complete

I see someone else found the TLS plaintext error. I have seen it and other TLS packet corruption problems probably due to the same write concurrent problem developers have found. So far can not upgrade to 2.6.2 due to other problems. But I thought I share this anomaly I discovered in hope it too may find a fix someday.

In my case I developed a test application which logs in 150 users into 50 shared groups 3 to a group. This can be done as fast as the server will allow or with a time delay between login’s.

The application uses smack and simply does a new XMPPConnection followed by a login. Similar to the code snippet below:

XMPPConnection con = new XMPPConenction (“myserver”);

if (con. isConnected ())

{

System.out.println ("Login as user/password, Secure connect: " +

con.isSecureConnection() + " using TLS: " +

con.isUsingTLS () + “’’”);

con.login (“user”, “password”);

}

Most of the time all is good. isSecureConnection and isUsingTLS both are true. the subsequent login is successful. Occasionally (read rarely), under heavy server load either self inflicted (heavy usage) or other applications (more then likely causing the concurrent TLS server scenario to occur). isSecureConnection and isUsingTLS return false. The subsequent login throws an exception with a huge stack trace way down into the bowels of Sun socket code. Usually a read or write error. The exception appears to be in another thread and can’t be caught in my code. Which makes in impossible to initiate a recovery when it occurs.

Being an old school developer, I have a saying “if it can happen it will happen” therefore the error should be handled.

I have inserted a loop of up to thirty 100 ms waits, waiting for isUsingTLS == true, which seems to help most of the time. But that is pretty kludgey way to fix and if secure login is disabled always instigates a 3 second login delay.

I believe that the constructor should not return until a successful login can be preformed (e.g. TLS negotiation complete) or throw XMPPException is the connection cannot be established.

Is there a better way to handle this potential login failure?

I agree with you, either an exception, either perhaps some internal use of a Future that will block the return until there’'s a positive connection. I have no real experience with concurrent programming so these are only ideas. Anyhow, I belive these might require libs that use Smack to get recompiled, right?

Since I believe it will take a code change to correct the problem, most certainly the libraries will need to be recompiled.

skip

Guess I have to fix it myself…

The fix: (smack 2.2)

The problem is in the startup method of org.jivesoftware.smack.PacketReader

snippet

public void startup() throws XMPPException {

readerThread.start(); << thread which sets connectionID

listenerThread.start(); << thread

// Wait for stream tag before returing. We’'ll wait a couple of seconds before

// giving up and throwing an error.

try {

synchronized(connectionIDLock) {

if (connectionID == null) {

In most cases either connectionID is null at the if and the wait loop is entered or

the connection and TLS is complete and the loop does not need to be entered.

but…

when connectionID is set down in parsePackets that doesn’'t mean TLS is finished!

so the failure is caused by the reader thread getting enough cpu time to set connectionID before the if test in startup is performed. So when it is performed it falls out as if the connection and TLS is ready (but it isn’'t). A subsequent login right after the supposed successful connect causes more packet interaction which interferes with the still ongoing TLS negotiation and everything falls apart (i.e. house of cards?)

In my test of 150 users with a wildfire server 2.6.2 w/a virus scan running and several dead loops kicked off, at least one connection would fail.

My fix is to not kick off the threads until after the test is made thusly:

public void startup() throws XMPPException {

// Wait for stream tag before returing. We’'ll wait a couple of seconds before

// giving up and throwing an error.

try {

synchronized(connectionIDLock) {

if (connectionID == null) {

readerThread.start();

listenerThread.start();

I tested about four runs without failures.

I also tested with TLS disabled, still functions.

skip

Clever! I wonder will it pass Quality Assurance and be merged in the next release?

Darn, i guess I’'ll have to study how a Future works another time, thanks!

Hi,

it will probably not pass QA as Alex did create SMACK-140 to remove the background thread so it gets synchronized without synchronized {}.

LG

In fact, we modified smack and we still were having issues. Our success rate rose but still less then 50% of successes.

Yep, I discovered yesterday while running some other tests that the problem will still occur.

In my test 100 users logged on all in the same shared group which creates a lot of presence traffic. The 60th user barfed. Both clients and server were running on the same machine in this test. One way to stop the error is to disable TLS. That is probably not an acceptable solution.

My solution made it better but it is not fixed.

In todays tests I saw many SSL failures.

The method I used to heighlight problems is to create a small looping batch file that prints “in a loop”. Then run it at above normal priority. Both my test server and test clients are running on the same machine and get very little cpu cycles.

They crawl, fail, then die.

I modifed the PacketReader introducing another flag “finished” did some other mods, it was much better but occasionally an error would occur.

I became suspicious of the KeepAlive task.

Whine on:

  1. A real keep alive keeps track ove socket activity and does nothing unless no activity in keep alive period.

2.If you are going to impelement keep alive as in Smack, you must mutex the writing object! Sort of kinda done but there are instance that are not protected

</stream:stream>. It would be ok to inject a space somewhere into this tag wouldn’'t it? NOT. Some other instances in XMPPConnection.

How bout do all the writing in one package! Now there’'s an idea!

Whine off:

Since modifying PacketReader AND turning off keep-alive I haven’'t seen any failures.

conenctionID in PacketReader should be set to null on an error, its not.

so iff SSL did fail at least a catchable exception would be thrown and recovery possible as it is now PacketReader thread just dies…

Great detective work, perhaps it would be usefull to the developers if you pasted it as a comment to SMACK-140? I tried to raise their awareness during the chat this week, I hope next week you’‘ll be there too, or even better yet that this bug will get nailed as soon as possible as it’‘s a real show stopper (especially on the server side where you don’‘t have a normal chat client but you’'ve got to have reliable components).

Message was edited by: schrepfler

Problem still there

After much investigation and testing on this one, I find even with Smack 2.2.1 this problem still exists. The KeepAlive problem caused similar symptoms.

The cause of this problem roots itself in the PacketReader package’'s startup method (mostly). connectionID is used as a flag in startup to break out (or never enter) a wait loop. Problem is the readerThread is started at beginning of this method. The readerThread sets the connectionID but it actually does so before TLS negotiation as been completed. This can cause the wait loop in the startup method to never be entered or terminate before TLS is actually complete. This causes the XMPPConnection constructor to return (without throwing an exception) and examining isConnected yields true but isSecureConnection and isUsingTLS will be false. Usually, directly after a new XMPPConnection, an application will perform a login. Since TLS negotiation is still happening, the XML emitted by the login confuses the TLS and things snowball down hill to a failure, usually an exception thrown in PacketReader (killing the readerThread).

It is simple enough to demonstrate this (on windows) by creating a small loop.bat which loops echoing I am in a loop. Run it. Adjust its priority to above normal. This slows everything down to sublight speed. Then have a test program perform a connection followed by a login (using TLS of course).

I have seen this occur without the compute intensive process running when just performing 100 consecutive logins. Other processes are always running on an OS some at higher priorities, so its just a matter of timing to get the problem to occur. My Smack app needs to run unattended (without error) 24/7. When the problem occurs it cannot be recovered from since the login method can’'t catch an exception thrown in another thread. Basically, the app goes deaf at that point.

Ok now the fix, I have worked on this a while and tested it extensively. It works. I can’'t say it will be the final solution for the Smack developers, though. I have tried to make it as simple as I can. It is specific to Smack 2.2.1 PacketReader although my original testing was on Smack 2.2.0.

in UNIX diff output.

diff PacketReader221old.java PacketReader221new.java

50a51

private boolean finished = false;

142d142

< if (connectionID == null) {

149c149

< while (connectionID == null && !done) {


while (!done && !finished) {

160d159

< }

189a189

finished = true;

205c205,206

< connection.close();


finished = true;

connectionID = null;

207a209

connection.close();

218c220

<


404a407

this.finished = true;

Basically, the wait loop in startup is always entered. A finished flag replaces connectionID as the flag that startup can proceed. Also, connectionID is set null on connection error.

This post was long enough. I could post the entire file although I felt that might be inappropriate.

Skip

Hey Skip,

Thanks again for your detailed analysis and feedback. You know have permission to upload attachments. Could you upload your modified file?

Thanks,

– Gato

Here tis
packetreader.java (34358 Bytes)

Hi all -

there seem to be many problems wrt. multithreading and custom synchronization issues. I hope you don’'t take this the wrong way but may I make the suggestion to adopt the util.concurrent way of doing things wrt. locks, futures and executors?

Since Smack must still support JDK 1.4 this can be had for free by using the JSR166-quasi-approved backport library from http://dcl.mathcs.emory.edu/util/backport-util-concurrent/. It is in the public domain, extremely well tested and actively maintained by someone who has very deep understanding about Java and its crappy synchronization mechanisms.

Ever since porting Mule from the older dl.util.concurrent and deleting custom thread handling I was able to drastically remove deadlocks, race conditions and cross-thread visibility issues and improving performance by reduced lock contention. As an added bonus you’'ll be able to port to JDK 1.5 by simply fixing the imports.

I realize that adding third-party dependencies is always worth consideration but if you have any questions re. quality, distribution etc. just drop Dawid a mail.

It is impossible to provide working (read: deterministically correct, stable and scalable) XMPP integration in Mule or other projects when Smack itself is not 100% bullet-proof, and using util.concurrent and its data structures simply cannot be beat for that.

Please don’'t hesitate to email me if you have further questions.

– Holger

I completely agree with the previous commenter, java.util.concurrent and it’‘s backport lib for 1.4 are absolutely a must for reliable and scalable concurrent code. When I see people like Josh Bloch, Joe Bowbeer, Brian Goetz, David Holmes, Doug Lea (spec lead), Tim Peierls behind the spec and other distinguished names it’‘s a guarantee of quality and I don’‘t see why should we rely on anything else. We started integrating Mule the other week and these issues that we encountered were a real show stopper. It’‘s a reason why it’‘ll probably won’'t be adopted so quickly in our project as we hoped. Please, these things are critical for enterprise adoption.