Intermittent SSO problems

Carl6 · February 27, 2013, 10:18pm

I searched all over these forums without success before deciding to register and post the problem I’m having.

----Environment----

Server: Server 2008 R2 with OpenFire 3.7.1 installed, LDAP integration (working fine)

Clients: Windows 7 Professional 64bit, Spark 2.6.3.12555

I followed the instructions here: http://community.igniterealtime.org/docs/DOC-1060 to setup SSO,

verified it was working (deleted spark.properties, copied krb5.ini and registry settings, ran spark, advanced>>SSO>>enable, enter server and logged on without entering credentials)

I then went and customized spark, deleting the exit and logout menu items, as well as setting it up to automatically generate the correct information in the spark.properties file when run for the first time.

Tested it half a million times on a test machines with different user profiles ect.

Deployed to about 120 computers (repackaged with AppDeploy, pushed with PDQDeploy)

Now a bunch of users are reporting they can’t login because they get the “Unable to connect using Single Sign-On.” error, but most of the users aren’t having any issues at all (and I’ve verified they are infact connecting with SSO)

I finally was able to reproduce the problem on a test computer (not being able to connect right after installing spark). However, nothing on the user/workstation side seems to fix it. I can uninstall my custom version of spark, purge the registry and filesystem of any traces, restart, re-install the unmodified version, copy krb5.ini and registry edits, turn on SSO and it still fails. Some times running “klist purge” and rebooting will resolve it. Most of the time it won’t, some times logging the user onto another workstation will suddenly make it work, some times it won’t.

When it fails there is nothing in the C:\Program Files (x86)\Spark\Logs\error.log file However, when I turn on Debug on the server side, I am able to capture this:

2013.02.27 09:33:16 org.jivesoftware.openfire.nio.ClientConnectionHandler - [/192.168.1.238:49631] Data Read: org.apache.mina.filter.support.SSLHandler@1dfe254 (HeapBuffer[pos=0 lim=22 cap=64: 17 03 01 00 11 49 D5 9E 0C F7 FC C7 2F 45 88 BC 61 ED 4E 5D 50 A6])

2013.02.27 09:33:16 org.jivesoftware.openfire.nio.ClientConnectionHandler - [/192.168.1.238:49631] unwrap()

2013.02.27 09:33:16 org.jivesoftware.openfire.nio.ClientConnectionHandler - [/192.168.1.238:49631] inNetBuffer: java.nio.DirectByteBuffer[pos=0 lim=22 cap=16665]

2013.02.27 09:33:16 org.jivesoftware.openfire.nio.ClientConnectionHandler - [/192.168.1.238:49631] appBuffer: java.nio.DirectByteBuffer[pos=0 lim=33330 cap=33330]

2013.02.27 09:33:16 org.jivesoftware.openfire.nio.ClientConnectionHandler - [/192.168.1.238:49631] Unwrap res:Status = OK HandshakeStatus = NOT_HANDSHAKING

bytesConsumed = 22 bytesProduced = 1

2013.02.27 09:33:16 org.jivesoftware.openfire.nio.ClientConnectionHandler - [/192.168.1.238:49631] inNetBuffer: java.nio.DirectByteBuffer[pos=22 lim=22 cap=16665]

2013.02.27 09:33:16 org.jivesoftware.openfire.nio.ClientConnectionHandler - [/192.168.1.238:49631] appBuffer: java.nio.DirectByteBuffer[pos=1 lim=33330 cap=33330]

2013.02.27 09:33:16 org.jivesoftware.openfire.nio.ClientConnectionHandler - [/192.168.1.238:49631] Unwrap res:Status = BUFFER_UNDERFLOW HandshakeStatus = NOT_HANDSHAKING

bytesConsumed = 0 bytesProduced = 0

2013.02.27 09:33:16 org.jivesoftware.openfire.nio.ClientConnectionHandler - [/192.168.1.238:49631] appBuffer: java.nio.DirectByteBuffer[pos=0 lim=1 cap=33330]

2013.02.27 09:33:16 org.jivesoftware.openfire.nio.ClientConnectionHandler - [/192.168.1.238:49631] app data read: HeapBuffer[pos=0 lim=1 cap=1: 20] (20)

2013.02.27 09:33:16 org.apache.mina.filter.executor.ExecutorFilter - Launching thread for /192.168.1.238:49631

2013.02.27 09:33:16 org.apache.mina.filter.executor.ExecutorFilter - Exiting since queue is empty for /192.168.1.238:49631

From that log, it seems like something is going wrong server side, but I can’t figure it out, because I can’t seem to find any scenario where it always (or never) works!

Anyone have any clues, hints or ideas?

Anything at all would be greatly appreciated, if I can get it working, I’ll post documentation on how I modified everything (since I’ve seen quite a few users asking how to do what I did) But if it doesn’t work, well there’s not really much point in posting a mod that breaks SSO…

Carl

speedy · February 28, 2013, 2:52am

Here are some things to check if you haven’t already

clear your dns cache and try again

do you have a ptr record for your xmpp server

are you using dns for kerberos info? if so, what dns servers are the workstations using? has that info replicated to them? are the dns entries correct?

Are you using krb5.ini file? has it been copied to the workstation and in the correct location?

Is UAC enabled? If so, disable it.

This shouldn’t be the case since you say sso is working on some workstations, but…What is your forest/domain function level? If 2008 r2 or higher, you’ll need to make sure your keytab user account is set to use DES encryption type, and you’ll need to allow DES encryption via a GPO. DES is disabled by default on 2008r2 and Windows 7.

Have you used wireshark on the workstations that are giving you trouble?

Hope some of this helps.

Carl6 · March 4, 2013, 3:34pm

Thanks for the quick reply

I’ve gone through and tried/checked a few things you suggested:

flushing the dns cache fixed the problem the first time (flushed dns, tried spark, it worked!) However, after logging off windows and then back on, it fails again, and a dns flush doesn’t fix it again.

Verified krb5.ini file was in the correct location (C:\Windows) (was formatted correctly)

Forest/domain function level is Server 2003

Went back to the instructions and checked my spn for xmpp-openfire. I found 2… not sure if that’s right or not, they were:

xmpp/servername.domain.local

xmpp/servername.domain.local@domain.local

(note: for privacy purposes the above is NOT my actual values)

Per your suggestion I fired up Wireshark and captured the network traffic while attempting the logon. I was able to capture a successful logon as well as 2 failures. The success was after flushing the dns cache, purging the tgt, restarting and forcing a gpupdate.

Found a few things of interest in the capture!

First off, on all of the captures there is a KRB5 packet that says:

“KRB Error: KRB5KRB_ERR_RESPONSE_TOO_BIG”

I did a little googling and it seems this is not a problem, it’s just the kerberous UDP failure that forces it to use TCP (or at least, so I concluded, and since the successful one also had it, it seems like a fare explanation)

**However, **on the successful logon the packet is from the correct ip address (KDC ip address) but on both failures the packet is from a secondary IP address leading to the same server… I’m thinking this is probably the issue.

speedy · March 4, 2013, 3:56pm

interesting…

the easiest way to test, would be to edit your host file on a workstation. first try the ip address that seems to be working…, then try with the secondary ip.

Carl6 · March 7, 2013, 3:23pm

Thanks for the tip.

Looks like that’s the problem. We have RRAS installed on our DC… so yeah, gonna be fixing that.