Users connect/disconnect, see high cpu - where to start looking

Hi there - I am hoping someone can direct me in the right direction on where to look…

I just upgraded to version 4.2.3 (4.1.x had same issue I’m trying to troubleshoot).

for debugging - I’m running openfire.exe with the following params (taken from another thread here)
-Djava.net.preferIPv4Stack=true
-XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode
-Xmx2048m
-XX:MaxPermSize=192m
-XX:NewSize=200m
-XX:SurvivorRatio=4
-Xss128k
-XX:ThreadStackSize=128
-Xloggc:c:\gc\gc.log

Windows 2012 R2 on AWS, 64 bit, 16gb, xeon E5-2686
DB is Sql Server 2014, 64
Was using bundled java, but then replaced it with latest JDK so I could use VisualVM
Using Candy.js

Load testing with 10 users, sending 1 message per second (not very high), I’d think there should be no issues with such a low load, but I see the cpu ramp up to 100% and that might be when clients start going on/off line.

Using a custom authentication (another sql database - dont believe that this is the issue)
Broadcast and search plugins are installed.

Any suggestions on what I could look at to troubleshoot this? I was hoping it was a simple java issue, but… not so sure anymore - maybe its an issue with candy.js…

Which exact JVMs (64 or 32 bit) are you using? Also, please test against 4.3.0 beta too.

What about doing it with some other client, to rule out Candy issue?

Thank you both.

  1. 64 - It was first the bundled jre- now its 8_191 jre bundled with jdk
    2 concurrently started to see what it takes to get spark going. - need to see how username , pass etc is passed in by candy

  2. Could go with 4.3.0 but saw java bug report of java errors for someone with windows, not sure if it’s stable enough

  3. Another twist - there is a load balancer in the mix (don’t know type) but this is the only machine in the lb pool - Aldo don’t know how much work it is to skip lb. http binding is disabled, certs exist - don’t know how dependent it is on valid cert

  4. Prod has similar setup, but same issues, but might have a cluster (2 machines)

Although I wish it isn’t… Very well might be candy… it’s silly that it struggles

4.3.0 beta is usable for testing. Only when updating from older version you will have to remove plugins\admin\webapp\WEB-INF\lib folder manually for Admin Console to work.

I thought id go back, flush the logs., - start it., and then just capture the logs for the load;

error log
2018.12.21 15:52:25 org.jivesoftware.openfire.http.HttpBindServlet - Error sending packet to client.
org.jivesoftware.openfire.http.HttpConnectionClosedException: The http connection is no longer available to deliver content

warn log
– this one problable not a concern.
2018.12.21 15:44:53 org.jivesoftware.openfire.spi.LegacyConnectionAcceptor - Configuration allows for up to 16 threads, although implementation is limited to exactly one.

see many
2018.12.21 15:45:25 org.jivesoftware.openfire.nio.ConnectionHandler - Closing connection due to exception in session: (0x0000001C: nio socket, server, /xxxxxxxx:50145 => /xxxxxx:5222)
javax.net.ssl.SSLHandshakeException: SSL handshake failed.

Caused by: javax.net.ssl.SSLHandshakeException: null cert chain

also see these
2018.12.21 15:46:12 org.jivesoftware.openfire.http.HttpSession - Unable to deliver a stanza (it is being queued instead), although there are available connections! RID / Connection processing is out of sync!

and these
2018.12.21 15:46:19 org.jivesoftware.openfire.PresenceRouter - Rejected available presence:



- org.jivesoftwarfqdn/ext id: 4zpdrj46zl presence:

ive xx’d out ips and usernames

seems like connections due to “Caused by: javax.net.ssl.SSLHandshakeException: null cert chain” - am i reading this right?

now on 4.3.0 (not nightly build)
did not have an issue with the lib folder in plugins, but did have the incorrect reference to /lib for log4 looking for at bin/lib (copied the libs over to bin for now)

Yeah, that’s another issue, but this is just for logging. Well, maybe it is important in this case.

found out the following bits of info

chat web page makes an Ajax call to an OAuth-proxy web service which in turn calls another server that connects to OpenFire to bind to XMPP and to get chat session which is returned to Candy JS

After Candy is initialized (obtained session info) performs /http-bind/ calls to OpenFire as necessary to send user’s talk and to receive responses. Those /http-bind/ calls are routed to OpenFire through a proxy. Proxy has rules of how to route /http-bind/ calls to OpenFire.

On the error log., I now have
2018.12.21 17:24:44 org.jivesoftware.openfire.handler.IQHandler - Internal server error
java.lang.NullPointerException: null
at org.jivesoftware.openfire.handler.IQMessageCarbonsHandler.handleIQ(IQMessageCarbonsHandler.java:52) ~[xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.handler.IQHandler.process(IQHandler.java:62) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:369) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:112) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:74) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:104) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:63) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.http.HttpSession.sendPendingPackets(HttpSession.java:639) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.http.HttpSession$HttpPacketSender.run(HttpSession.java:1284) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]

and saw a bunch of
Unable to deliver a stanza (it is being queued instead…

An update on progress.

I’ve been able to directly authenticate a user skipping candy.js; for my test I had to use converse.js + BOSH.

FWIW - this is the same way its implemented with server side logic., using BOSH/http-binding.

We’re now looking at getting Tsung configured so that we can repeat our testing at scale to see if we have the same issues.

Are there any known issues/limitations or performance metrics for users over http-bind? (besides the one referenced below)

also found this interesting link and the quoted text below (too bad those scripts dont exist anymore)
http://gsoc.safasofuoglu.org/2008/06/19/load-testing-bosh-on-openfire/

There seems to be a critical point somewhere around 350th second. This is the point where the client ran out of CPU. After this point, client threads started to lose their connections because they couldn’t get CPU along the inactivity period, which was 30 seconds. When a client doesn’t make a request to the server during the inactivity period, its session will be killed by the server. Jetty responds with a 404 - Not Found message to clients who have lost their session

So - after tracing the code… it looks like the setting that needs to be added is

xmpp.httpbind.client.request

The default is 2

Clients set this higher than 2

But since this session is being created by backend services - this is never set. I arbitrarily used 100 and have no issues going beyond the few users connecting/reconnecting issue.

Hope it helps someone else.

1 Like