Retries for failed remote servers prevent communication with working ones

Hi,

I had an experience where a remote server became uncontactable (jabber.zim.net.au), and found my JM installation stopped connecting to other servers which are known to be working.

Turning on debug logging, I can see that JM is continually attempting to connect to the failed server for each of my contacts, and takes a very long time to reach the working servers, which basically prevents any communication at all.

Can someone confirm this problem, or suggest a work-around?

Could this maybe be addressed by implementing some fair queuing scheme so that one broken server doesn’'t prevent all other connections?

Hey adamk,

Are you still using JM 2.2.2? For JM 2.3.0 we implemented JM-416 that improves the server the exact way you are proposing. Let me know if the problem has gone away with the latest release.

Thanks,

– Gato

Unfortunately, I’‘ve been trying it with 2.3.0 in the hopes that that particular change would fix it for me. It hasn’'t, though.

Hey adamk,

So what you are seeing is that only one s2s connection is being tried to establish to a remote server at a time, is that correct? Out of the box, Jive Messenger has 20 threads that should be trying to establish connections to remote servers in parallel. Note that if many packets are being sent to the same server and a new connection is needed then many threads are going to be consumed but blocked while only 1 is actually trying to connect. Maybe this is what is happening in your installation?

To confirm that all threads are being consumed and all of them are trying to establish a connection to the same server, you can get a thread dump of the JVM (Java virtual machine), enable the debug log to watch the s2s handshake activity and post the packets that clients sent to the remote server.

If this is your case then a quick solution would be to set the system property xmpp.server.outgoing.threads[/i] to a higher value. And I will try get an improvement that does not consume a thread if a thread is already trying to establish a connection to the same remote server.

Thanks,

– Gato

Well, I use an MSN transport that is on the failed remote server. What I’'m seeing in the debug log is that JM is spending all its time attempting to query the presence of my entire contact list through that server, and continually timing out. (For each of my MSN contacts, it tries msn.jabber.zim.net.au, then jabber.zim.net.au, then zim.net.au, then net.au, then au… all of which obviously fail.)

During this, I can’'t seem to make connections to any other remote servers.

The thread dump shows only 5 threads alive and blocked from “pool-2-thread-?”… those are the outgoing server threads, right? One looks like it’'s blocked on TCP connection, the others are blocked at “org.jivesoftware.messenger.server.OutgoingServerSession.authenticateDomain(Out goingServerSession.java:136)”.

I set xmpp.server.outgoing.threads to 20, but there still only appears to be 5 of these “pool-2-thread-?” threads.

Any clues there?