i narrowed down our s2s issues now. After the change in resolv.conf the main s2s connections between 2 servers work fine. But there are big problems on s2s connections to components to other servers.
if i connect from myjabber.net or jivesoftware.com to firstname.lastname@example.org i get 2 s2s connections to conference.jabber.org. One incoming and one outgoing. Now if there is no activity in the chatroom jabber.org is closing it’'s outgoing s2s connection which is the incoming connection on the wildfire server. Now we have a one way connection, and it looks like the incoming connection from jabber.org never gets established again. If i send messages to the chat room or change my presence it will never appear in my chat window.
Same problem appears when browsing other non wildfire servers. I send out IQ packets and never get back a result, no timeout and no response. I think i should at least get a timeout if there is no response.
So i think there is a s2s problem between ejabberd and wilfire servers. And there must be a similar problem between jabberd 1.x and wildfire.
i have some more details about this problem.
When the server is in this strange state, and i have no outgoing s2s connection (according to the webinterface) and I try to send a packet to a server/component with a missing outgoing s2s connection nothing happens. I see nothing in the logs, no errors and no info.
To me it looks like the server does nothing, just putting my packet on a stack.
It could take several minutes to several hours to get this s2s connection again. If its available again the server delivers all the packets from the stack.
Is there no timeout for packets in the server? I think the server should drop a packet after it could not deliver it for e.g. 60 seconds and respond with a timeout error.
i made some more tests.
If i enable a whitelist for s2s and allow only some servers to connect there are no problems at all. If i disable the whitelist and allow all servers we have the same problems.
Its very strange that i see absolutely no errors or connection attempts in the info log.
Could somebody from Jive Software please comment this problem and let us know how the s2s connections are handled on the server?
Is it possible that it queues all connections attempts and this queue gets bigger and bigger in our case?
Or does is handle all s2s connection attempts in multiple threads simultaneous?
Is it possible that there are problems with several hundreds of simultaneous s2s connections on a single wildfire server?
after i blocked some domains which showed many errors and failed s2s connections in the Debug log the problems seem to be fixed.
But it would be better to find the root of the problem. Failing s2s connections should not cause failures to other s2s connections or block them from connecting. Any ideas how this could happen?