S2S woes

I’‘m attempting to get S2S working reliably. I’'ve got 2.3.0 (alpha) running on Linux at both ends. One server is a conventional IP, the other end is behind a NAT and using port forwarding. There are full DNS entries for both boxes. The public IP on the NAT network is on the router, and the host itself configured with a 192.168.1.2 private IP.

When both servers are started the system initially works. I can see muc rooms on each server, and communicate OK. The problem seems to crop up when the s2s connection goes down. The firewall between the two boxes is pretty notorious for terminating TCP connections violently and without warning. When the 5269 s2s connection goes down it seems to not come back up.

First of all, the dialback operation seems to be failing (DNS names changed to protect the innocent):

2005.10.06 06:32:22 org.jivesoftware.messenger.server.ServerDialback.createOutgoingSession(ServerDia lback.java:194) Error creating outgoing session to remote server: cs.spn.edu(DNS lookup: cs.spn.edu)

java.net.ConnectException: Connection timed out

This is failing because the server is dialing back to the wrong host–it should be trying to dial back to “xmpp.cs.spn.edu”, but the first part of the FQDN is being removed. Any idea why this may be happening? It’‘s strange, since as I say it initially works. It’'s only after some period of time, and I think a violent death to the 5269 s2s connection, that this crops up.

I also see this on the box behind the NAT:

2005.10.06 06:52:34 org.jivesoftware.messenger.server.ServerDialback.createOutgoingSession(ServerDia lback.java:194) Error creating outgoing session to remote server: spn.org(DNS lookup: spn.org)

This makes me worried for two reasons: first, just as before, it looks like jive lops off the first part of the FQDN. Second, it looks like the server is trying to establish a connection to itself based on the DNS name. Does this do an actual, guaranteed DNS lookup, or will it go through the nsswitch process? Since this is on a NAT, I’'ve got an /etc/hosts entry that points to 192.168.1.2 for that name, while DNS points to the public IP.

mcgredo,

It sounds like you’‘re running into some sort of bug. However, the spn.org and cs.spn.edu lookups are easily explained. Let’‘s say you have the server: xmpp.example.com. Your conference service will be at conference.xmpp.example.com. However, setting up DNS for that sub-domain can be a big PITA for many users. Therefore, Jive Messenger will walk up the DNS tree to see if there’‘s a server at the parent domain that can handle the connection request. So, let’'s take the example above:

That make sense?

So, what’'s basically happening is that connections to your box are failing for some reason. It could be due to DNS issues, general network issues, or some strange error in Jive Messenger.

Any chance you can turn this into a reproducible test case? Ie, yank a network cable at a certain time and then see the issue?

Regards,

Matt

Hmm. What’'s the algorithm for the fallback to the reduced FQDN? Will it go from xmpp.cs.spn.edu to cs.spn.edu if the connection fails to be established, or only if the DNS name fails to resolve? And is there an attempt to try the original name after some period of time?

What could be happening is that the underlying network simply fails. The connection to xmpp.cs.spn.edu will fail, and then all the connections for names up the FQDN will fail as well. Does it reset and try xmpp.cs.spn.edu eventually?

BTW, I’'ve got debugging logging turned on now, so I hope to get some better diagnostics.