Hi there - I am hoping someone can direct me in the right direction on where to look…
I just upgraded to version 4.2.3 (4.1.x had same issue I’m trying to troubleshoot).
for debugging - I’m running openfire.exe with the following params (taken from another thread here)
-Djava.net.preferIPv4Stack=true
-XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode
-Xmx2048m
-XX:MaxPermSize=192m
-XX:NewSize=200m
-XX:SurvivorRatio=4
-Xss128k
-XX:ThreadStackSize=128
-Xloggc:c:\gc\gc.log
Windows 2012 R2 on AWS, 64 bit, 16gb, xeon E5-2686
DB is Sql Server 2014, 64
Was using bundled java, but then replaced it with latest JDK so I could use VisualVM
Using Candy.js
Load testing with 10 users, sending 1 message per second (not very high), I’d think there should be no issues with such a low load, but I see the cpu ramp up to 100% and that might be when clients start going on/off line.
Using a custom authentication (another sql database - dont believe that this is the issue)
Broadcast and search plugins are installed.
Any suggestions on what I could look at to troubleshoot this? I was hoping it was a simple java issue, but… not so sure anymore - maybe its an issue with candy.js…
64 - It was first the bundled jre- now its 8_191 jre bundled with jdk
2 concurrently started to see what it takes to get spark going. - need to see how username , pass etc is passed in by candy
Could go with 4.3.0 but saw java bug report of java errors for someone with windows, not sure if it’s stable enough
Another twist - there is a load balancer in the mix (don’t know type) but this is the only machine in the lb pool - Aldo don’t know how much work it is to skip lb. http binding is disabled, certs exist - don’t know how dependent it is on valid cert
Prod has similar setup, but same issues, but might have a cluster (2 machines)
Although I wish it isn’t… Very well might be candy… it’s silly that it struggles
4.3.0 beta is usable for testing. Only when updating from older version you will have to remove plugins\admin\webapp\WEB-INF\lib folder manually for Admin Console to work.
I thought id go back, flush the logs., - start it., and then just capture the logs for the load;
error log
2018.12.21 15:52:25 org.jivesoftware.openfire.http.HttpBindServlet - Error sending packet to client.
org.jivesoftware.openfire.http.HttpConnectionClosedException: The http connection is no longer available to deliver content
warn log
– this one problable not a concern.
2018.12.21 15:44:53 org.jivesoftware.openfire.spi.LegacyConnectionAcceptor - Configuration allows for up to 16 threads, although implementation is limited to exactly one.
see many
2018.12.21 15:45:25 org.jivesoftware.openfire.nio.ConnectionHandler - Closing connection due to exception in session: (0x0000001C: nio socket, server, /xxxxxxxx:50145 => /xxxxxx:5222)
javax.net.ssl.SSLHandshakeException: SSL handshake failed.
…
Caused by: javax.net.ssl.SSLHandshakeException: null cert chain
also see these
2018.12.21 15:46:12 org.jivesoftware.openfire.http.HttpSession - Unable to deliver a stanza (it is being queued instead), although there are available connections! RID / Connection processing is out of sync!
and these
2018.12.21 15:46:19 org.jivesoftware.openfire.PresenceRouter - Rejected available presence:
now on 4.3.0 (not nightly build)
did not have an issue with the lib folder in plugins, but did have the incorrect reference to /lib for log4 looking for at bin/lib (copied the libs over to bin for now)
chat web page makes an Ajax call to an OAuth-proxy web service which in turn calls another server that connects to OpenFire to bind to XMPP and to get chat session which is returned to Candy JS
After Candy is initialized (obtained session info) performs /http-bind/ calls to OpenFire as necessary to send user’s talk and to receive responses. Those /http-bind/ calls are routed to OpenFire through a proxy. Proxy has rules of how to route /http-bind/ calls to OpenFire.
On the error log., I now have
2018.12.21 17:24:44 org.jivesoftware.openfire.handler.IQHandler - Internal server error
java.lang.NullPointerException: null
at org.jivesoftware.openfire.handler.IQMessageCarbonsHandler.handleIQ(IQMessageCarbonsHandler.java:52) ~[xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.handler.IQHandler.process(IQHandler.java:62) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:369) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:112) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:74) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:104) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:63) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.http.HttpSession.sendPendingPackets(HttpSession.java:639) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at org.jivesoftware.openfire.http.HttpSession$HttpPacketSender.run(HttpSession.java:1284) [xmppserver-4.3.0-beta.jar:4.3.0-beta]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
and saw a bunch of
Unable to deliver a stanza (it is being queued instead…
There seems to be a critical point somewhere around 350th second. This is the point where the client ran out of CPU. After this point, client threads started to lose their connections because they couldn’t get CPU along the inactivity period, which was 30 seconds. When a client doesn’t make a request to the server during the inactivity period, its session will be killed by the server. Jetty responds with a 404 - Not Found message to clients who have lost their session
So - after tracing the code… it looks like the setting that needs to be added is
xmpp.httpbind.client.request
The default is 2
Clients set this higher than 2
But since this session is being created by backend services - this is never set. I arbitrarily used 100 and have no issues going beyond the few users connecting/reconnecting issue.