So one of our test boxes went crazy today. 2 boxes are linked, and we’re running shared rosters using groups and the auto-subscribe plugin. Servers are whitelisted.
We took down a box to test a build, and the other went nuts. CPU and memory usage went through the roof, and all the error logs filled up with the following looping error (trimmed for readability):
2007.10.04 10:06:48 [org.jivesoftware.openfire.session.OutgoingServerSession.createOutgoingSession (OutgoingServerSession.java:258)] Error trying to connect to remote server: YYY.XXX.com(DNS lookup: YYY.XXX.com:5269)
java.net.ConnectException: Connection refused
2007.10.04 10:06:48
[org.jivesoftware.openfire.session.OutgoingServerSession.createOutgoingSession (OutgoingServerSession.java:258)] Error trying to connect to remote server: XXX.com(DNS lookup: XXX.com:5269)
java.net.ConnectException: Connection refused
This loop occured 2252 times over a 44 second period. I’m sure there were more, but that was the contents of the six rotated openfire error logs when we killed it.
I think this is an issue with openfire not handling subdomains of the form ZZZ.YYY.XXX.com correctly (something we have previously encountered elsewhere).
Either way, shouldn’t this DNS lookup actually give up after a couple of attempts and maybe queue a retry at a reasonable time in the future?
Is there a fundamental bug in how openfire deals with subdomains with 3 or more periods in? If so, is it being addressed? If not, what is causing this unexpected behavior?
Any input on this would be appreciated.