Bug in Server to Server? Looping error

So one of our test boxes went crazy today. 2 boxes are linked, and we’re running shared rosters using groups and the auto-subscribe plugin. Servers are whitelisted.

We took down a box to test a build, and the other went nuts. CPU and memory usage went through the roof, and all the error logs filled up with the following looping error (trimmed for readability):

2007.10.04 10:06:48 [org.jivesoftware.openfire.session.OutgoingServerSession.createOutgoingSession (OutgoingServerSession.java:258)] Error trying to connect to remote server: YYY.XXX.com(DNS lookup: YYY.XXX.com:5269)
java.net.ConnectException: Connection refused
2007.10.04 10:06:48
[org.jivesoftware.openfire.session.OutgoingServerSession.createOutgoingSession (OutgoingServerSession.java:258)] Error trying to connect to remote server: XXX.com(DNS lookup: XXX.com:5269)
java.net.ConnectException: Connection refused

This loop occured 2252 times over a 44 second period. I’m sure there were more, but that was the contents of the six rotated openfire error logs when we killed it.

I think this is an issue with openfire not handling subdomains of the form ZZZ.YYY.XXX.com correctly (something we have previously encountered elsewhere).

Either way, shouldn’t this DNS lookup actually give up after a couple of attempts and maybe queue a retry at a reasonable time in the future?

Is there a fundamental bug in how openfire deals with subdomains with 3 or more periods in? If so, is it being addressed? If not, what is causing this unexpected behavior?

Any input on this would be appreciated.

So no one has any input on this? It’s quite a serious error and I would have thought someone on the dev team would have something to say about this? Known error? Never heard of anything like this before?

Hey Clive,

Could you obtain a thread dump of the server when it starts to consume a lot of CPU? Under linux/solaris you can execute kill -3 [process id]. I guess that something (e.g. the auto-subscribe plugin) is continually forcing a s2s to be established. There is an improvement filed to not attempt to create a new s2s if one recently failed. However, if the plugin is requesting a s2s you will still see a high CPU usage. Therefore, we need to figure out the root of the problem.

Regards,

– Gato