Wildfire with ActiveDirectory LDAP hangs the wildfire server

First off, thanks to everyone who’'s answering problems here, in less than 4 hours I was able to get wildfire configured to use active directory by looking at answers to other people questions.

Now I have a system where its a very large active directory with lots and lots of users.

I do a tail -f debug.log | grep sAMAccountName to watch my login.

Right after my authentication is done and hits the logs, tons of other authentications start to pour into the logs. This is after I have removed the enableDebug from within the tag.

The wildfire server seems to get so overloaded that /etc/init.d/wildfire stop is not able to stop it. I have to reboot. This was true both with and without the ldap.enableDebug tag. Here are my current ldap/debugging/database setttings.

/code

Thanks,

Joel Schaubert

Message was edited by: bikeracer4700

Message was edited by: bikeracer4700

Message was edited by: bikeracer4700

Hi Joel,

would you please edit your post and put the file contents between and ?

LG

Done.

Bummer that there’'s no easy way to get the angle brackets and other parts of the xml to come through. Neither pre nor code tags did the trick.

Joel

joel,

I think, you have not done it right yet.

lots and lots of code /code

It should be like that.

regards,

wmhtet

Well, for me it looks like this using the code tag. I hope that bold[/b], italic[/i] or underline[/u] (left of spell check) are working for you?

/code

Oh now that’'s just sweet lookin.

I was trying to use html style angle backets around the word code.

Using squares is much better, thanks.

If your config is what you used to start, may I say that your adminDN might contribute to the problem. You are entering the user name, not the distinguished name (which should be of same format[/i] of your baseDN).

I wasn’'t able to get the adminDN to work that way. I tried several combinations based on

CN=Schaubert, Joel;OU=OU_MYCOUsers;DC=myco;DC=com

After a few hours of adding more Log.debug messages to pin down what is happening, here is what I’'ve learned.

The code call is coming in through

net/SASLAuthentication.java doPlainAuthentication()

auth/AuthFactory.java authenticate()

ldap/LdapAuthProvider.java authenticate()

Now here is where it gets strange.

I can track the code via debug messages as running once through the steps in LdapAuthProvider.authenticate(). So at the end of this routine I added a log message saying its done.

I can see this message in the logs, AND my client gets authenticated and hooks up.

BUT THEN, after a short pause (about 2-3 seconds) the log file takes off again and it seems that LdapManager.findUserDN(username); starts going mad and just checking out one user after another walking through the entire LDAP directory.

Now the code in LdapAuthProvider.authenticate() says something about needing to search all the subtree’'s of the baseDN for the username.

try {

// The username by itself won’'t help us much with LDAP since we

// need a fully qualified dn. We could make the assumption that

// the baseDN would always be the location of user profiles. For

// example if the baseDN was set to “ou=People, o=jivesoftare, o=com”

// then we would be able to directly load users from that node

// of the LDAP tree. However, it’'s a poor assumption that only a

// flat structure will be used. Therefore, we search all subtrees

// of the baseDN for the username. So, if the baseDN is set to

// “o=jivesoftware, o=com” then a search will include the “People”

// node as well all the others under the base.

Log.debug("LdapAuthProvider getting userDN "+userDN);

userDN = manager.findUserDN(username);

Log.debug("LdapAuthProvider authenticating userDN "+userDN);

// See if the user authenticates.

if (!manager.checkAuthentication(userDN, password)) {

throw new UnauthorizedException(“Username and password don’'t match”);

}

}

catch (Exception e) {

throw new UnauthorizedException(e);

}

// If cache is enabled, add the item to cache.

if (authCache != null) {

authCache.put(username, StringUtils.hash(password));

}

Log.debug("LdapAuthProvider DONE authenticating userDN "+userDN);

}

/code

But what I can’'t understand is the following

  1. why do all these additional user lookups happen AFTER my user is already done being authenticated

  2. why is LdapManager.findUserDN(username) getting called for all these other users in the directory

  3. where are these calls being fired from? I don’'t see any loops in LdapAuthProvider.authenticate() that look like they could be the causes of the extra calls. Could there be possible threading problems with one of the ldap interface libraries?

Here are the logs that come out, you will see my user being authenticated. Then things go crazy and we start walking all users

2006.03.14 10:29:00 Connect Socket[addr=/10.23.80.143,port=45351,localport=5222] 2006.03.14 10:29:00 SASL calling AuthFactory authenticate

2006.03.14 10:29:00 AuthFactory 1 calling authenticate

2006.03.14 10:29:00 LdapAuthProvider getting userDN null

2006.03.14 10:29:00 Trying now, to find a user’'s DN based on their username. sAMAccountName: jschaube, Base DN: OU=OU_mycoUsers;DC=corp;DC=myco;DC=com…2006.03.14 10:29:00 Creating a DirContext in LdapManager.getContext()…

2006.03.14 10:29:00 Created hashtable with context values, attempting to create context…

2006.03.14 10:29:00 … context created successfully, returning.

2006.03.14 10:29:00 Starting LDAP search…

2006.03.14 10:29:01 … search finished 2006.03.14 10:29:01 LdapAuthProvider authenticating userDN CN=Schaubert, Joel,OU=OU_2014ACE,OU=OU_0102Technology,OU=OU_0001myco

2006.03.14 10:29:01 In LdapManager.checkAuthentication(userDN, password), userDN is: CN=Schaubert, Joel,OU=OU_2014ACE,OU=OU_0102Technology,OU=OU_0001myco…

2006.03.14 10:29:01 Created context values, attempting to create context…

2006.03.14 10:29:01 … context created successfully, returning.

2006.03.14 10:29:01 LdapAuthProvider DONE authenticating userDN CN=Schaubert, Joel,OU=OU_2014ACE,OU=OU_0102Technology,OU=OU_0001myco

/code

Ok, now we have my user authenticated. This is where it gets weird. That DONE message above is one I added at the end of the LdapAuthProvider.authenticate() routine. So how/why would there still be calls to the routines in LdapManager running around and doing more work?

2006.03.14 10:29:01 Created context values, attempting to create context…

2006.03.14 10:29:01 … context created successfully, returning.

2006.03.14 10:29:01 LdapAuthProvider DONE authenticating userDN CN=Schaubert, Joel,OU=OU_2014ACE,OU=OU_0102Technology,OU=OU_0001ETradeGroup

2006.03.14 10:29:01 Trying now, to find a user’'s DN based on their username. sAMAccountName: jschaube, Base DN: OU=OU_myco;DC=corp;DC=myco;DC=com…

2006.03.14 10:29:01 Creating a DirContext in LdapManager.getContext()…

2006.03.14 10:29:01 Created hashtable with context values, attempting to create context…

2006.03.14 10:29:01 … context created successfully, returning.

2006.03.14 10:29:01 Starting LDAP search…

2006.03.14 10:29:01 … search finished

2006.03.14 10:29:01 Creating a DirContext in LdapManager.getContext()…

2006.03.14 10:29:01 Created hashtable with context values, attempting to create context…

2006.03.14 10:29:01 … context created successfully, returning.

2006.03.14 10:29:01 Creating a DirContext in LdapManager.getContext()…

2006.03.14 10:29:01 Created hashtable with context values, attempting to create context…

2006.03.14 10:29:01 … context created successfully, returning.

/code

Now a 2-3 second pause before the logs start cranking out stuff again, then they never stop and wildfire cannot be shutdown

2006.03.14 10:29:07 Creating a DirContext in LdapManager.getContext()…

2006.03.14 10:29:07 Created hashtable with context values, attempting to create context…

2006.03.14 10:29:07 … context created successfully, returning.

2006.03.14 10:29:07 Trying now, to find a user’'s DN based on their username. sAMAccountName: smountie, Base DN: OU=OU_myco;DC=corp;DC=myco;DC=com…

2006.03.14 10:29:07 Creating a DirContext in LdapManager.getContext()…

2006.03.14 10:29:07 Created hashtable with context values, attempting to create context…

2006.03.14 10:29:07 … context created successfully, returning.

2006.03.14 10:29:07 Starting LDAP search…

2006.03.14 10:29:07 … search finished

2006.03.14 10:29:07 Creating a DirContext in LdapManager.getContext()…

2006.03.14 10:29:07 Created hashtable with context values, attempting to create context…

2006.03.14 10:29:07 … context created successfully, returning.

2006.03.14 10:29:08 Trying now, to find a user’'s DN based on their username. sAMAccountName: chenders, Base DN: OU=OU_myco;DC=corp;DC=myco;DC=com…

/code

So I just show enough here to show the loop that it’'s in, it repeats with a different user name each time until I use kill -9 to kill all the wildfire processes.

Anyone with any experience in the LDAP code of wildfire, I would be glad to have even speculation on how so many calls get turned loose as I could add more logging based on any decent theories.

thanks

Joel Schaubert

Hey Joel,

Sorry to hear that you are having such hard time.

First question, are you using shared groups? Do you have a public shared group (i.e. a group that will appear in everyone’'s roster)?

If the answer is yes then I highly recommend doing an upgrade to Wildfire 2.5.1 if you are using an oldest version. Some optimizations were done regarding shared groups and LDAP. Another tip that can be useful is to use a filter to reduce the number of users to integrate with Wildfire (if possible).

Does this problem happen every time a user logs in or when the first user logs in the server?

BTW, you may want to take a couple of thread dumps[/url] to find out what the JVM/server is doing?

Regards,

– Gato

I think you may be onto something here.

I’'ve added more debug and now I think that is correct behaviour (not a bug) that just takes a super long time and way too much CPU to finish its job.

2006.03.15 11:46:42 LdapGroupManager.getGroups() calling popluate groups using search filter of (member=*)

2006.03.15 11:46:42 Trying to find all groups in the system.

2006.03.15 11:46:42 Starting LDAP search…

2006.03.15 11:46:42 Using groupSearchFilter: (member=*)

2006.03.15 11:46:48 Starting to populate groups with users.

/code

What I’'m going to do is just let this thing run and see how long it takes.

My guess is that it will be hours long, but it will give me a good baseline.

Then I’'ll take a look at trying to move from 2.5.0 to the newer 2.5.1 and see how much faster it runs as a comparision.

Doesn’'t look like there is a gentoo ebuild for 2.5.1 yet but I can probably write one myself

http://packages.gentoo.org/search/?sstring=wildfire

Joel Schaubert

Hey Joel,

Sounds like a good plan to me. And regarding the gentoo ebuild for Wildfire…awesome.

Thanks,

– Gato

I have a baseline now, and a question about strange behaviour.

After I used the gaim client to login, wildfire server took

21minutes 6 seconds to load 2284 ldap users /code

doing its call to LdapGroupManager.getGroups().

Then when I came back in this morning and started working on the machine running the client, the gaim client detects the typing and changes my status from offline back to online.

At wildfire, this seems to kick off another loaduser AND another loadgroups so another 21 minutes of building up user tables. This seems like a bad behaviour for wildfire to have if an ldap directory is large, it this a bug?

2006.03.15 12:38:20 Creating a DirContext in LdapManager.getContext()…

2006.03.15 12:38:20 Created hashtable with context values, attempting to create context… 2006.03.15 12:38:20 … context created successfully, returning.

2006.03.16 05:42:52 LdapUserProvider.loadUser calling findUserDN jschaube

2006.03.16 05:42:52 Trying now, to find a user’'s DN based on their username. sAMAccountName: jschaube, Base DN: OU=OU_mycoUsers;DC=corp;DC=mycogrp;DC=com…2006.03.16 05:42:52 Creating a DirContext in LdapManager.getContext()…

2006.03.16 05:42:52 Created hashtable with context values, attempting to create context…

2006.03.16 05:42:57 … context created successfully, returning.

2006.03.16 05:42:57 Starting LDAP search… 2006.03.16 05:42:57 … search finished

2006.03.16 05:42:57 Creating a DirContext in LdapManager.getContext()…

2006.03.16 05:42:57 Created hashtable with context values, attempting to create context… 2006.03.16 05:42:58 … context created successfully, returning.

2006.03.16 05:42:58 GroupManager.Collection getGroups()

2006.03.16 05:42:58 LdapGroupManager.getGroups() calling popluate groups using search filter of (member=*)

2006.03.16 05:42:58 Trying to find all groups in the system.

/code

Seems that’'s also the problem when I update group sharing policy of a large group.

Upgraded to wildfire 2.5.1

First user login

21minutes 20seconds to lookup 2282 ldap users

/code

Second user login, no additional lookups

First user requests to be on second users buddy list, and the server again goes off and request all ldap users and hangs again in another query for all groups.

When this query for all groups finally finishes, then the clients pop up windows to confirm the request to add to buddy list.

This is looking pretty bad for usability. It seems that the wildfire server is walking the entire ldap directory too often and too slowly to make is usable with large (size 2000) directories. I’'ll leave the client up again all night to see if when I come back the next day if simply waking up again causes the server to requery every ldap user again.

Hey joel,

Thanks for the info. I’'m now working on this problem trying to reproduce it and fix it. Do you have an XMPP account so we can talk more fluently? My XMPP address can be found in my profile or next to my name in this post.

Thanks,

– Gato

No unfortunately I don’'t have any external xmpp access setup anywhere yet.

Until I can figure out where/how to do that I’'ll just keep a close watch on the forum.

Hey Joel,

I’'m almost done with some important performance improvements. The improvements will affect mainly when users log in or request their rosters. I have a couple of questions for you:

  1. Is your posixMode value false? More LDAP queries are done when posixMode is false. I’'m still analyzing if we can avoid them.

  2. How many shared contacts do you have per roster? And how many contacts (local + shared) do you have per roster?

There is an expensive operation that I cannot avoid that is getting the user display name . So if you have many shared contacts the server cannot avoid querying the LDAP server to get such info. The information is cached but still the server will have to make those queries at least once or every time the caches expires.

I may check in these improvements so you can use the nightly build version and let me know if you see any improvements. Though if you have many contacts in your roster then many LDAP queries are going to be made.

Regards,

– Gato

There is an expensive operation that I cannot avoid

that is getting the user display name . So if you

have many shared contacts the server cannot avoid

querying the LDAP server to get such info. The

information is cached but still the server will have

to make those queries at least once or every time the

caches expires.

Something to keep in mind is LDAP servers are used to this kind of “abuse” (or at least, they should be). As an example, we have several hundred Linux hosts using LDAP as the user backend. We ran into problems cacheing, so we dont cache anything on the clients. All these clients perform ldap queries every time a usrename/uid needs to be looked up on any of these hosts. It gets quite busy as you can imagine. We have multiple LDAP servers, but all the clients point to a single LDAP server first, and we generally dont see any problems (it is not a dedicated LDAP server either).

So, performing a bunch of queries once and a while (when cache expires) should not slow things down too much. It would be interesting to get some statistics about how long queries are taking, and compare that to other ldap environments. While its important to streamline the queries being generated, the server should get some optimizations too (make sure you have proper indexes for common queries, replicate to a less busy server, for some examples)

I will glady keep testing any changes you have. Either patch files or new releases will be fine.

  1. My posixMode will be whatever the default is since I have no tag like that.

  2. Unfortunately I do not know LDAP well enough to answer these questions. If you could tell me where I could do some additional reading then I could try and answer. I am using active directory and there are around 2000 users and perhaps around 200 groups. I don’'t yet know what shared contacts nor rosters are.

The main thing I see is that wildfire does not trust its cache very much.

The two cases are when someone tries to add a new buddy, and secondly after a really long idle time (overnight) when someone reactiviates their client by touching a key or mouse. These cases seem to drive right past the cache and try to make raw ldap queries again.

I’'ll keep checking for how you want to handle and test any performance improvements.