New wildfire implementation showing java memory errors

What type of trace do you want to see? A packet leve trace (wireshar) or something out of the stack trace tool?

I want to see if there’‘s anything in the thread dump that points to an LDAP connection left open. That will tell us if it’'s a problem in Java-land or something in the O.S. or firewall.

One interesting part of this - and one I’'m looking into more - is that not only does it keep active TCP/IP LDAP connections, but then those sit in CLOSED WAIT state for a very long time.

We are starting to wonder if the fact that we have a firewall between the Wildfire server and the LDAP/AD server is in play.

Interesting. That could definitely be a potential issue. Is it possibly also an operation system problem?

Also: One person suggested we actually locally compile Wildfire with the JDK. Does that suggestion make sense to you?

It doesn’‘t make any sense to me. In fact, I’'m not sure what they are talking about.

-Matt

Hi Matt,

did you already take a look at the stack traces sent to Gato?

LG

PS: I agree that compiling Wildfire locally does not make any sense, except to keep one busy.

Ok - I’'ll get a thread dump from java and mail it on.

I have found something interesting in a packet dump.

I see wildfire using an existing (probably a pooled) LDAP connection to search for the user id.

Then it starts a separate BIND request for authentication.

This process runs normally, and the LDAP server responds.

Wildfire then ACKS the BIND respond, but does not send a FIN/close.

The connection then stays open for 13-18 minutes (The idle timeout on the AD servers until the AD server sends a FIN back. So it’'s almost as if wildfire is not closing the authentication BIND request until the AD server shuts it down.

Note - the packet trace was done from the wildfire server, so it’'s not the firewall dropping the FIN.

Packet trace (for those that like this kind of stuff like I do:)

Packet 48 is the ACK of the BIND response at 10:09:59.144009.

Then the connectioin sits idle until a FIN ACK is sent from the AD server at 10:24:56.889828 (Packet 679 - packets 49-678 are other LDAP connections). This same pattern happens with every authentication request made.

No. Time Source Destination Protocol Info

44 10:09:59.032477 172.21.232.107 172.21.35.87 TCP 4459 > ldap Seq=0 Len=0 MSS=1460

Frame 44 (62 bytes on wire, 62 bytes captured)

Ethernet II, Src: HewlettP_0b:9f:a0 (00:15:60:0b:9f:a0), Dst: Netscree_55:6c:07 (00:10:db:55:6c:07)

Internet Protocol, Src: 172.21.232.107 (172.21.232.107), Dst: 172.21.35.87 (172.21.35.87)

Transmission Control Protocol, Src Port: 4459 (4459), Dst Port: ldap (389), Seq: 0, Len: 0

No. Time Source Destination Protocol Info

45 10:09:59.033434 172.21.35.87 172.21.232.107 TCP ldap > 4459 Seq=0 Ack=1 Win=16384 Len=0 MSS=1460

Frame 45 (62 bytes on wire, 62 bytes captured)

Ethernet II, Src: Netscree_55:6c:07 (00:10:db:55:6c:07), Dst: HewlettP_0b:9f:a0 (00:15:60:0b:9f:a0)

Internet Protocol, Src: 172.21.35.87 (172.21.35.87), Dst: 172.21.232.107 (172.21.232.107)

Transmission Control Protocol, Src Port: ldap (389), Dst Port: 4459 (4459), Seq: 0, Ack: 1, Len: 0

No. Time Source Destination Protocol Info

46 10:09:59.033468 172.21.232.107 172.21.35.87 TCP 4459 > ldap Seq=1 Ack=1 Win=65535 Len=0

Frame 46 (54 bytes on wire, 54 bytes captured)

Ethernet II, Src: HewlettP_0b:9f:a0 (00:15:60:0b:9f:a0), Dst: Netscree_55:6c:07 (00:10:db:55:6c:07)

Internet Protocol, Src: 172.21.232.107 (172.21.232.107), Dst: 172.21.35.87 (172.21.35.87)

Transmission Control Protocol, Src Port: 4459 (4459), Dst Port: ldap (389), Seq: 1, Ack: 1, Len: 0

No. Time Source Destination Protocol Info

47 10:09:59.033914 172.21.232.107 172.21.35.87 LDAP bindRequest(1) simple

Frame 47 (147 bytes on wire, 147 bytes captured)

Ethernet II, Src: HewlettP_0b:9f:a0 (00:15:60:0b:9f:a0), Dst: Netscree_55:6c:07 (00:10:db:55:6c:07)

Internet Protocol, Src: 172.21.232.107 (172.21.232.107), Dst: 172.21.35.87 (172.21.35.87)

Transmission Control Protocol, Src Port: 4459 (4459), Dst Port: ldap (389), Seq: 1, Ack: 1, Len: 93

Lightweight-Directory-Access-Protocol

No. Time Source Destination Protocol Info

48 10:09:59.037570 172.21.35.87 172.21.232.107 LDAP bindResponse(1)

Frame 48 (76 bytes on wire, 76 bytes captured)

Ethernet II, Src: Netscree_55:6c:07 (00:10:db:55:6c:07), Dst: HewlettP_0b:9f:a0 (00:15:60:0b:9f:a0)

Internet Protocol, Src: 172.21.35.87 (172.21.35.87), Dst: 172.21.232.107 (172.21.232.107)

Transmission Control Protocol, Src Port: ldap (389), Dst Port: 4459 (4459), Seq: 1, Ack: 94, Len: 22

Lightweight-Directory-Access-Protocol

No. Time Source Destination Protocol Info

49 10:09:59.144009 172.21.232.107 172.21.35.87 TCP 4459 > ldap Seq=94 Ack=23 Win=65513 Len=0

Frame 49 (54 bytes on wire, 54 bytes captured)

Ethernet II, Src: HewlettP_0b:9f:a0 (00:15:60:0b:9f:a0), Dst: Netscree_55:6c:07 (00:10:db:55:6c:07)

Internet Protocol, Src: 172.21.232.107 (172.21.232.107), Dst: 172.21.35.87 (172.21.35.87)

Transmission Control Protocol, Src Port: 4459 (4459), Dst Port: ldap (389), Seq: 94, Ack: 23, Len: 0

No. Time Source Destination Protocol Info

679 10:24:56.889828 172.21.35.87 172.21.232.107 TCP ldap > 4459 Seq=23 Ack=94 Win=65442 Len=0

Frame 679 (60 bytes on wire, 60 bytes captured)

Ethernet II, Src: Netscree_55:6c:07 (00:10:db:55:6c:07), Dst: HewlettP_0b:9f:a0 (00:15:60:0b:9f:a0)

Internet Protocol, Src: 172.21.35.87 (172.21.35.87), Dst: 172.21.232.107 (172.21.232.107)

Transmission Control Protocol, Src Port: ldap (389), Dst Port: 4459 (4459), Seq: 23, Ack: 94, Len: 0

No. Time Source Destination Protocol Info

680 10:24:56.889859 172.21.232.107 172.21.35.87 TCP 4459 > ldap Seq=94 Ack=24 Win=65513 Len=0

Frame 680 (54 bytes on wire, 54 bytes captured)

Ethernet II, Src: HewlettP_0b:9f:a0 (00:15:60:0b:9f:a0), Dst: Netscree_55:6c:07 (00:10:db:55:6c:07)

Internet Protocol, Src: 172.21.232.107 (172.21.232.107), Dst: 172.21.35.87 (172.21.35.87)

Transmission Control Protocol, Src Port: 4459 (4459), Dst Port: ldap (389), Seq: 94, Ack: 24, Len: 0

No. Time Source Destination Protocol Info

681 10:24:56.889940 172.21.232.107 172.21.35.87 LDAP unbindRequest(2)

Frame 681 (61 bytes on wire, 61 bytes captured)

Ethernet II, Src: HewlettP_0b:9f:a0 (00:15:60:0b:9f:a0), Dst: Netscree_55:6c:07 (00:10:db:55:6c:07)

Internet Protocol, Src: 172.21.232.107 (172.21.232.107), Dst: 172.21.35.87 (172.21.35.87)

Transmission Control Protocol, Src Port: 4459 (4459), Dst Port: ldap (389), Seq: 94, Ack: 24, Len: 7

Lightweight-Directory-Access-Protocol

No. Time Source Destination Protocol Info

682 10:24:56.889970 172.21.232.107 172.21.35.87 TCP 4459 > ldap Seq=101 Ack=24 Win=65513 Len=0

Frame 682 (54 bytes on wire, 54 bytes captured)

Ethernet II, Src: HewlettP_0b:9f:a0 (00:15:60:0b:9f:a0), Dst: Netscree_55:6c:07 (00:10:db:55:6c:07)

Internet Protocol, Src: 172.21.232.107 (172.21.232.107), Dst: 172.21.35.87 (172.21.35.87)

Transmission Control Protocol, Src Port: 4459 (4459), Dst Port: ldap (389), Seq: 101, Ack: 24, Len: 0

No. Time Source Destination Protocol Info

683 10:24:56.890657 172.21.35.87 172.21.232.107 TCP ldap > 4459 Seq=24 Ack=101 Win=0 Len=0

Frame 683 (60 bytes on wire, 60 bytes captured)

Ethernet II, Src: Netscree_55:6c:07 (00:10:db:55:6c:07), Dst: HewlettP_0b:9f:a0 (00:15:60:0b:9f:a0)

Internet Protocol, Src: 172.21.35.87 (172.21.35.87), Dst: 172.21.232.107 (172.21.232.107)

Transmission Control Protocol, Src Port: ldap (389), Dst Port: 4459 (4459), Seq: 24, Ack: 101, Len: 0

No. Time Source Destination Protocol Info

684 10:24:56.890705 172.21.35.87 172.21.232.107 TCP ldap > 4459 Seq=24 Len=0

Frame 684 (60 bytes on wire, 60 bytes captured)

Ethernet II, Src: Netscree_55:6c:07 (00:10:db:55:6c:07), Dst: HewlettP_0b:9f:a0 (00:15:60:0b:9f:a0)

Internet Protocol, Src: 172.21.35.87 (172.21.35.87), Dst: 172.21.232.107 (172.21.232.107)

Transmission Control Protocol, Src Port: ldap (389), Dst Port: 4459 (4459), Seq: 24, Len: 0

Pat

I haven’'t been able to find any bug reports that match this behavior yet. However, one interesting one:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4823077

Check out the summary. It mentions that any NamingEnumerations should be closed explicitly. Authentication operations don’‘t use NamingEnumerations, but several other LDAP operations do without calling close. Are you positive that it’‘s the auth connection that’'s not closing and not one of the other LDAP operations?

Regards,

Matt

This bug report also looks interesting:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6176036

What I saw in the thread dump is threads like the following:

“Thread-3157” daemon prio=6 tid=0x29d05d38 nid=0x12b8 runnable

at java.net.SocketInputStream.socketRead0(Native Method)

at java.net.SocketInputStream.read(Unknown Source)

at java.io.BufferedInputStream.fill(Unknown Source)

at java.io.BufferedInputStream.read1(Unknown Source)

at java.io.BufferedInputStream.read(Unknown Source)

  • locked <0x25568600> (a java.io.BufferedInputStream)

at com.sun.jndi.ldap.Connection.run(Unknown Source)

at java.lang.Thread.run(Unknown Source)

That’‘s certainly a thread sitting there waiting to read something. I’'m just not sure why…

One thing we could try is creating a simple testing program that only executes an auth and then nothing else. That would be a good way to isolate the issue. Maybe AD is doing something funky such that Java doesn’'t know that it can fully shut down the connection??

-Matt

Hi,

looking at “at org.jivesoftware.wildfire.ldap.LdapManager.checkAuthentication(LdapManager.java :346)” with Matt it seems that did specify a baseDN and an alternateBaseDN and line 346 is using the alternateBaseDN. The code looks fine, but it could be possible that the resource which is using baseDN is not freed properly.

If you have setup an IDE and did compile Wildfire before you may be able to edit LdapManager lines 311+ and 362+ and change the try block to

try {
      if (ctx != null) { ctx.close();}
      } catch (Exception e) {
          Log.error(e);
          ctx=null;
       }

Anyhow we are not yet sure that this really helps.

LG

Hi All.

I can confirm that it is indeed the authentication conversation being held open.

I’'ve watched it via wireshark many times, and can correlate the authentication bind request to my login activity (including seeing what passing a bad password looks like)

I can spin a small implementation up on a separate box and try to hit an iplanet LDAP server vs AD to compare the activity if that would be useful.

Pat

Hi Pat

I don’'t think that this will help. Do you have the option to use just one baseDN or switch baseDN and alternateBaseDN?

It seems that you have a good network sniffer, so could you please check which of the authentication connection keeps open. (Packet 47 - which of the DN’'s is it using?)

I assume it is the one which uses baseDN and fails while the one which uses alternateBaseDN successfully authenticates the user and gets closed.

LG

PS: @Matt: The documentation reads:

ldap.alternateBaseDN – a second DN in the directory can optionally be set. If set, the alternate base DN will be used for authentication and loading single users, but will not be used to display a list of users (due to technical limitations).

This is not really the case, even if set first the baseDN is used and if it fails then the alternateBaseDN will be used. So one may change the code to match the described behaviour.

Hi LG/Matt.

I’'ve just finished some comparative testing.

I installed a fresh wildfire install under Linux (Fedora) with no firewall in the mix at all.

I set the same AD parameters for authentication against the same AD server, and we tested with just three users.

The results to the AD controller were the same. The individual authentication BIND requests would stay open for 15 minutes until the AD controller timed them out.

I then changed the authentication to an iplanet LDAP server.

This time, the results were less comforting. The individual authentication BIND requests stayed open for over an hour - or until I finally shut the wildfire server down.

One interesting thing we found - if a user passed the wrong password and the BIND request failed, the wildfire server immediatly shut down the connection.

So the target LDAP server settings seem to play a big role in this. However, in all cases, the lingering connections are not OS or LDAP server specific. They follow the Wildfire server.

Now I’‘m not sure I understand LG’'s baseDN/AlternatebaseDN items.

I have a baseDN set, as well as a searchfilter. I also have an adminDN - but that is just the full CN of the user I use to do the initial BIND for the server. I have no AlternatebaseDN

The wildfire server does a BIND with the AdminDN. It then keeps this connection open and does subsequent SEARCH actions for users trying to connect. The searches use the baseDN. This same connection is used over and over for searches.

For each user logging in, the results of the search from the AdminDN are used on a separate BIND connection, passing the search-returned CN and the users password as credentials. This is the one that lingers.

I can’'t be unique in this. Is anyone else out there seeing the same thing?

A simple netstat will show all your active LDAP connections.

As for network sniffers

I use the excellent freeware wireshark (www.wireshark.org) for Windows (and to interpret dumps from Linux), and plain ol tcpdump for Linux.

Pat

Message was edited by: pmalone@lsil.com

Pat,

Thanks for the detailed update – it’‘s appreciated. I’‘m going to do some more testing this weekend to try to get to the bottom of the issue. I’'m suspecting a Java bug, but hope to know more soon.

Regards,

Matt

Hi Pat,

the line:346 issue looks weird. The source of the current 3.1 version is (just for me as a further reference)

http://www.jivesoftware.org/fisheye/viewrep/svn-org/wildfire/branches/3_0_branch /src/java/org/jivesoftware/wildfire/ldap/LdapManager.java?r=4018

but you are using 3.0.1 so I must look at the old code (just for me as a further reference)

http://www.jivesoftware.org/fisheye/viewrep/svn-org/wildfire/branches/3_0_branch /src/java/org/jivesoftware/wildfire/ldap/LdapManager.java?r=4399

/*343*/     if (debug) {
/*344*/          Log.debug("Created context values, attempting to create context...");
/*345*/     }
/*346*/     ctx = new InitialDirContext(env);
/*347*/     if (debug) {
/*348*/          Log.debug("... context created successfully, returning.");
/*349*/     }
/*404*/     finally {
/*405*/          try { ctx.close(); }

So the source code is clean and also I wonder now if this is a java bug.

LG

Hi Matt,

looking again at the stack trace “at com.sun.jndi.ldap.pool.Pool.getPooledConnection(Unknown Source)” I wonder whether it is already a pooled connection. Using one for authentication may be really silly, especially if it is kept open for com.sun.jndi.ldap.connect.pool.timeout=’‘no timeout’’ seconds.

The default for connection pool is defined in the code with

if (connectionPoolEnabled) {
   env.put("com.sun.jndi.ldap.connect.pool", "true");
}
  • so it’'s already set to true.
  1. So I wonder if it woold help do set “” to disable it completely, the auth and all other connections should be closed immediately.

  2. Or to add “-Dcom.sun.jndi.ldap.connect.pool.timeout=15000” to the vmoptions file - so the auth and all other connections should be closed should be closed after 15 seconds.

@ Pat: 1) and 2) should work for you, now it looks more like bad coding (poor Matty) than a bug for me.

LG

PS: http://java.sun.com/j2se/1.5.0/docs/guide/jndi/jndi-ldap.html

Update from my end: I can duplicate the TIME_WAIT behavior. When the code does a bunch of LDAP operations, I see lots of sockets in TIME_WAIT (using TCPView under Win XP). However, I don’‘t see actual open connections from the Java perspective (doing thread dumps). That means that only the operating system is dealing with those connections. Unfortunately, I don’'t have a way to check to see what AD is doing.

I have made number of tweaks to the LDAP code that might be helping. I would suggest trying out tomorrow’'s nightly build or the next beta of Wildfire 3.1, which will be out very soon.

Regards,

Matt

Hi Matt/LG.

I’'ll test the LDAP items on my test environment and see what happens.

So Matt - since you feel it is “OS specific” and not java related that the connections are lingering open, does that indicated that if this problem (lingering LDAP connections) is solved, it won’'t have any positive impact on the memory situation?

I guess I need to know : can I expect to get 1500-2000 active connections on wildfire under Windows - (or under Linux) - with LDAP authentication and SQL backend - without having to implement the connection manager on separate hardware?

Thanks

Pat

Hi LG.

I added the into the LDAP section on my test environment (Linux), and the behavior was as you expected, both against AD and iPlanet LDAP.

For each client connection, a bind is made using the generic or named account in the xml file to search the target LDAP for the user CN. This connection is then immediatly terminated by Wildfire.

Wildfire then does a bind connection to LDAP using the user credentials (CN from search, password from user). Once the LDAP server resonds, Wildfire immediatly terminates the LDAP connection.

I will stage this change for our production environment. It seems to me it can’'t be worse!

The big question will be if this has a direct impact on the memory usage or not, since that is the underlying issue I need to resolve!

So thanks, and I’'ll post the results back tomorrow after the production restart.

Pat

Hi Pat,

the new Wildfire version does allow you to enable the connection pool again, Matt did change the connection pool logic.

With the new release or connection.pool=false you should never see these native OutOfMemory errors.

I do always calculate 1 MB for every connected user which usually is too much to have a reserve, 512 kB (Xmx1024m / 2000 users) are very tight and may or may not be enough.

LG

Hi LG.

Thanks for the info. I’'m anxious to see how the LDAP pooling disable impacts our environment.

Per the memory issue - I recall you giving that 1MB/user issue before. Where I am stuck is getting Java to use the memory the machine has available - the system has 2gig of RAM. When we take Xmx up to 1024 (or higher), the number of users connected before memory errors goes down. Perhaps the LDAP pooling is interacting with that, and we can safely increase Xmx.

But the most optimum settings we have found are:

-Xms512m

-Xmx512m

-Xss128k

-Xoss128k

And that won’'t get us past 600-700 connections.

Pat

Arrr!

Tis a fine talk like a pirate day!

Disabling the LDAP pooling did the trick!

We have 580+ connections with not a single memory error.

The LDAP connections to the AD server close right away with no lingering.

Current thread dump memory report is

Used: 402630224 (~384MB)

Free: 92559792 (~88MB)

Total: 495190016 (~472MB)

Max: 517013504 (~493MB)

So we will run out at some point, but it’'s not failing/erroring with that is there.

My next plan is to build up an identical test server with a test AD, and use the jabsimul tool to run it up to 2000 users, testing different java memory settings.

I’'ll post results here when I have something.

Thank for all the help!

Pat

Pat,

Yay! I like happy endings to issues like this.

-Matt

Matt.

This was a success in finding the impact of LDAP/AD on our environment.

But just so there is no misconception, we are not ‘‘done and good to go’’.

We still have to figure out how to get the environment to use all the memory the machine has available. We are maxed at 600 users right now, and can’'t afford to ‘‘play’’ in our production environment. So, as I said, this requires hardware setup and time to get some load testing in place.

I’‘m still not convinced we’'ve settled on the right server implementation. I just want to exhaust all options before forcing my users through yet another change.

P