I’ve been working on trying to debug this problem for several weeks (off and on), in order to deploy a new server on our internal network. Same symptoms–we’re using Active Directory and logins fail intermittently. I’ve verified that when there is a failure, the Windows server never gets a request–I used Network Monitor to watch packets. When a login failure occurs, it is due to some low-level network connectivity issue or a bug in establishing a request from the Openfire server, not from a true authentication failure on the server or in processing the reply. The LDAP server never gets the request.
I’ve tried several things which I thought might fix the problem, but it just recurred again–it’s hard to verify that an intermittent problem is actually fixed since sometimes it seems to work fine for days and then fails again.
Today I hit on a new theory that I’m just testing now. Figured I’d post it just in case “this is it” and it works for someone else.
I realized that our WS2003 system actually has a name that resolves to two separate IP addresses: 192.168.50.2 (the right one) and 192.168.50.25 (the one that’s allocated by RRAS for VPN connections). DNS returns both addresses, and the order flip-flops from query to query.
Port 389 (ldap) seems to listen on both addresses, but I have a gateway/firewall sitting between the Openfire server and the Windows server that has a hole punched for port 389 on the primary static IP, and I never opened that port up to the other IP address used by RRAS. That wouldn’t even be practical since RRAS allocates that IP address from DHCP and it might change from time to time. Any connection attempts to ldap at the RRAS server address would fail. So… My theory is that OpenFire does a DNS lookup for the hostname, sometimes finds the RRAS IP address, tries to connect but the port isn’t open, and fails.
The solution (if this is actually the issue here) is to change the ldap.host parameter to the primary IP address of the Windows server rather than using the FQDN and thus bypass the round-robin DNS.
I just implemented this and it’s working fine right now. I’m not 100% certain this fixes it, however, due to the intermittent nature of this error. I’d love feedback from you guys if you find that your networks have similar configurations and this is a possible explanation for your problems.