Openfire 4.0.1 and problem with increasing size of non-identified sockets

What you’re saying is that currently, the session is indirectly it’s own backup deliverer? That would correspond with my earlier thought that everything appears to be one session.

I am still worried by these log statements:

Unable to deliver a stanza (it is being queued instead), although there are available connections! RID / Connection processing is out of sync!

I can’t shake the feeling that something is off with the client implementation and/or load balancing setup.

Can you explain what log message mean, may be I could understand what was going wrong in our client

A client can use more than one connection (HTTP request) at the same time to exchange data with the server. While it has send one HTTP request (to which no response has yet been returned), another HTTP request can be made to send more data.

To keep track of what answer should go to what request, a ‘rid’ value is used (see: https://xmpp.org/extensions/xep-0124.html#rids )

The warning that is logged indicates that Openfire cannot identify the correct request to use to send back data. An obvious reason would be that the connections are all closed, but this specific warning is logged when there are open connections (but they do not have the expected ‘rid’ value).

class org.jivesoftware.openfire.websocket.WebSocketConnection // 108
public PacketDeliverer getPacketDeliverer() {
    if (backupDeliverer == null) {
        backupDeliverer = new OfflinePacketDeliverer();
    }
    return backupDeliverer;
} class org.jivesoftware.openfire.SessionManager // 386
public HttpSession createClientHttpSession(...) throws UnauthorizedException {
    if (serverName == null) {
        throw new UnauthorizedException("Server not initialized");
    }
    PacketDeliverer backupDeliverer = server.getPacketDeliverer(); // Returns the PacketDeliverer registered with this server.
                       // The PacketDeliverer was registered with the server as a module while starting up the server.
    HttpSession session = new HttpSession(backupDeliverer, serverName, address, id, rid, connection, language);

Looking at the code every session is indeed it’s own backup deliverer. For websocket connections the proper deliverer (OfflinePacketDeliverer) is used while SessionManager uses the default deliverer as its backup deliverer. It may be a simple change to replace “server.getPacketDeliverer();” with “new OfflinePacketDeliverer();”.

Well, problems are not resolved. Maybe openfire is not optimized for http-bind for such amount of online users. Opened files by openfire increasing hour ny hour. It seems like connections not closes.
We try to use another solution for jabber-server.

@Guus der Kinderen any comment to the backupDeliverer? It seems this bug has been in Openfire for 10 years.

1 Like

newstask.JPG

this is new stack with blocked 26551 threads by 1. this is one of the problem. Also we got another problem with opened files. We take them with lsof Number of “can’t identify protocol” is increasing. I think they are not properly closed connections. The log attached
lsof-log.txt.zip (113925 Bytes)
2016-04-15_09_38.zip (694012 Bytes)

That what makes me wonder if we’re on the right track, LG. Loads of people have used BOSH in that time. If there was such a bug, we would have noticed by now.

[OF-813] Memory leak - IgniteRealtime JIRA and there are a few threads here where a httpbind memory leak was suspected but it couldn’t be found. Some fixes were applied without knowing whether they fixed the root cause. It seems that it is not always triggered. And with little dropped connections this bug will be hard to notice.

It seems that here one connection is dropped before it receives its packages. And/or it has an issue receiving packets and can only send data. Fixing the client would be a great idea but the server should keep running w/ issues.

Nevertheless assigning here the standard deliverer instead of the backup one seems to be a bug. As there are no comments in the code which explain why the std. del. was used I can only assume that Gato or Derek or whoever implemented this in 2006 used it as a quick solution for initial tests w/ adding a “/* TODO replace w bck. del. */” note.

at org.jivesoftware.openfire.http.HttpSession.failDelivery(HttpSession.java:1060)
at org.jivesoftware.openfire.http.HttpSession.closeSession(HttpSession.java:1042)
- locked <0x000000069b54c2b8> (a java.util.Collections$SynchronizedRandomAccessList) #  1595 waiting threads
- locked <0x000000069995e1c8> (a java.util.Collections$SynchronizedRandomAccessList) #  692 waiting threads
- locked <0x000000069b54c2b8> (a java.util.Collections$SynchronizedRandomAccessList) # 24250 waiting thread

10417x can’t identify protocol

1862x TCP connections (1816x HTTP-Bind, 31x XMPP)

Do the number of open files (lsof) increase slowly to this high value or do you see this high value only Openfire creates so many threads?

We changed haproxy configuration which stays in front of openfire from http mode to tcp mode. It passed 6 hours and there is no problem.

Tomorrow I will report total information.

We see number of open files increases slowly. Limit is 100k, when reaches this limit we get - too many open files error

Now we changed haproxy from http to tcp balancing mode. And after 6 hours there is now problem.

Now that you mention HAProxy - there were similar issues with Netty + NIO --> Handling TCP RST · Issue #3193 · netty/netty · GitHub

Maybe Openfire has similar issues and keeps the send connetion open after receiving a RST by HAProxy.

@Guus This would explain why we don’t see this issue quite often as normal clients use the FIN handshake to close the session. So one should fix both issues (proper RST handling and proper backup deliverer) if it’s possible to recreate it. Anyhow my knowledge about Mina/NIO is small and creating a testcase for the potential RST issue may be hard.

Can you share the haproxy configuration you used before and the exact haproxy version so we can create an issue with steps how to hopefully reproduce this issue?

We have nginx and haproxy in front of openfire. because chat is additional module for portal. Nginx standing infront - runs for SSL termination. For proxing chat requests config on nginx

location /http-bind/ {
                        proxy_buffering off; #Отключим буфферизацию;
                        proxy_read_timeout 100s; #Установим таймаут для ожидания;
                        tcp_nodelay on;
                        proxy_pass http://chat.haproxy:7007;
        }

After stands haproxy in tcp mode

listen jabber *:7007
        mode tcp
        option tcplog
        server jabber1 openserver1:7070 check maxconn 12000

Well, ou adventure is over. We have updated strophie.js and switched to websockets.

Problmes gone away. Now we have online about 3500 users, RAM is used for it just 600mb. Problems with open file is disappeared

Thanks for all!

After upgrading from Openfire 3.9.3 to 4.0.2 we’re seeing what seems to be the same issue. These exceptions keep appearing in the log (still unclear what the cause of these are):

2016.09.26 14:16:28 WARN  [Jetty-QTP-BOSH-100612]: org.eclipse.jetty.server.HttpChannel - /http-bind/
java.lang.IllegalStateException: state=EOF
        at org.eclipse.jetty.server.HttpInput.setReadListener(HttpInput.java:376)
        at org.jivesoftware.openfire.http.HttpBindServlet.doPost(HttpBindServlet.java:152)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at org.jivesoftware.openfire.http.HttpBindServlet.service(HttpBindServlet.java:116)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)

With every exception the number of abandoned file descriptors increases by one.

Looking at the HttpBindServlet.doPost method there’s this:

final AsyncContext context = request.startAsync();
// Asynchronously reads the POSTed input, then triggers #processContent.
request.getInputStream().setReadListener(new ReadListenerImpl(context));

If an exception is thrown in setReadListener (as is the case in the stack trace), then context.complete() is not invoked. Should there be a try-catch here to make sure complete is called on the context object if an exception is thrown?

I’ve created a pull request with a fix for the leaking file descriptors:

Complete context if an exception is thrown when setting read listener by jforsell · Pull Request #653 · igniterealtime/O…

Thanks Johan. I’ve created OF-1201 in our issue tracker for this change.