Load driven extremely high.. tracking how to duplicate

Hi folk!

I discovered an issue yesterday where Wildfire drove my load up quite a bit. (from “around 0” to “around 10”) My preliminary investigation of it has turned up the following information:

  1. I was running the yahoo transport (xmpppy project) as an external component to my jabber server

  2. I do, of course, have whatever@yahoo.jabber.vorpalcloud.org items in my roster

  3. Yahoo transport crashed in the middle of some operation with wildfire

  4. Looking in the server session tab, there was a connection to yahoo.jabber.vorpalcloud.org and chat.yahoo.jabber.vorpalcloud.org. I was investigating this at around 2pm. Those sessions had been “open” since around 8am.

  5. Upon closing both sessions, the load immediately dropped to 0 like it was supposed to.

I wanted to go ahead and post a start to this. If I have some time to, I’'m going to try to duplicate precisely what caused this. (and see if I can find a way to be able to reproduce it trivially)

This still periodically occurs. Have yet to track down exactly what causes it.

Hi Daniel,

if it is the Wildfire java process a lot of (at least two) “kill -3 wildfire-pid” should help a lot to check what threads are running/looping.

LG

Aha, thanks, I keep forgetting about that. Every time Gato asks for more information I can’‘t make it show it’'s head again. =) And when it shows, I forget what to do to get it going. lol

13:07:47 up 12 days, 23:12, 2 users, load average: 3.47, 5.44, 8.43

25359 jive 19 0 392m 105m 20m S 99.4 6.9 122:30.77 java

Ok, I got a thread dump, but it’‘s like 81K =) I can post it here, but it’'ll be hella long.

So instead, http://www.vorpalcloud.org/~daniel/threaddumpduringload.log

Herre’‘s the output from the sessions screen. Note that it’'s treating aim, msn, and yahoo as outgoing sessions and is “stuck” in a pissed off mode. The moment I sever those connections (close) the load drops.

1

aim.jabber.vorpalcloud.org

Outgoing 11:02 AM 1:21 PM

2

gmail.com

Incoming 12:33 PM 1:20 PM

3

jab.igloonet.com

Incoming 1:13 PM 1:18 PM

4

jabber.rootbash.com

Incoming 1:07 PM 1:15 PM

5

jabber.vorpalcloud.org

Incoming 11:02 AM 1:21 PM

6

msn.jabber.vorpalcloud.org

Outgoing 11:02 AM 1:21 PM

7

mzet.net

Both 1:15 PM 1:16 PM

8

Incoming 1:21 PM 1:21 PM

9

yahoo.jabber.vorpalcloud.org

Outgoing 11:02 AM 1:21 PM

Can I reproduce it every time? hahaha no =(

Typically this happens when I start wildfire, am logged in, and then start the transports. I think that something is getting confused in the time period between. Like I log in, Wildfire sees that it is not aim.jabber.vorpalcloud.org, Wildfire tries to do an s2s connection, before I sever the connection PyAIMt connects and claims aim.jabber.vorpalcloud.org, load and confusion ensues.

Hi Daniel,

took you more than one thread dump while Wildfire was using much CPU? “sun.misc.Unsafe.unpark” looks interesting but this could be just normal. As you can close the connections which cause the trouble I wonder if you did also take one then.

LG

I took like 4 or 5 of them just ''cause. =D Maybe it would catch something interesting.

Hi Daniel,

I may take a look at your dumps if you want to zip them and offer a download location either here or via pm.

LG

The link above was the download location. =)

I’'ve seen the link to one thread dump, but it seems that I missed the one to all dumps

http://www.vorpalcloud.org/~daniel/threaddumpduringload-2.log I thought I had overwritten the other one with the multiple thread one. Dooh.

Hi Daniel,

is there one dump in this file which was taken after you did close the sessions?

All javacores look very similar, except the first one which starts with “An unexpected error has been detected by HotSpot Virtual Machine”

LG

Ya know, no, I didn’‘t take one after I got things running. Didn’'t think about that.

I did find out that I can damn near duplicate this behavior on command now. I think that when I log in to my account before the python transports have connected, the aim.jabber.* and friends jids are treated as external (s2s) refs. As such, wildfire gets bogged down in trying to connect to them externally or even tracks/caches that they are an external ref. Then the python transports connect and all of a sudden there’‘s this weird internal jid that’‘s referenced in the external cache and it’'s confused. I believe, if this is an accurate assessment, that simply setting it up so that when an “external component” connects that it wipes the s2s cache entry for it.

This is all theory.