Openfire Cluster behaves unexpectedly on the MUC

abdurrahmanekr · October 31, 2019, 8:43pm

Hello! We have some issues that we would like to consult with you about a Project developed using openfire. Especially, clustering. Let’s talk about what we did. We have already used openfire to chat with our staff 400 persons, and we have already experienced this for four years in amateur way. We consider to sell the application. The companies that we will sell the aplication have between 1000 and 2000 users. Hovewer, the plugins and servers we use are not seen enough for a single openfire. That’s why we need clustering to handle the load which is required. Also, we have a touble with the MUC. When we use clustering, users receive the following package when sending messages to groups:

<error code="406" type="modify"><not-acceptable xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/></error>

Let’s talk about the method our using clustering. First of all, we dockerize openfire. We share the document as an open source at this address. Dockerfile is designed in a very simple way. All thing that should be done is downloaded the specified version file of openfire and install the package.
The last version of dockerized openfire is 4.4.2. The package is stored on hub.docker.com with the name abdurrahmanekr/openfire:4.4.2.

To make clustering, we create multiple openfire services with docker-compose and start them with docker-compose up -d. And all openfire services are connected to a single volume. Our example docker-compose.yml is followed as:

version: '2'

services:
  openfire1:
    restart: always
    depends_on:
      - db
    image: abdurrahmanekr/openfire:4.4.2
    ports:
      - "17070:7070"
      - "19090:9090"
    volumes:
      - 'openfire-data:/var/lib/openfire'
      - 'openfire-plugins:/usr/share/openfire/plugins'

  openfire2:
    restart: always
    depends_on:
      - db
    image: abdurrahmanekr/openfire:4.4.2
    ports:
      - "17071:7070"
      - "19091:9090"
    volumes:
      - 'openfire-data:/var/lib/openfire'
      - 'openfire-plugins:/usr/share/openfire/plugins' 

  # DATABASE
  db:
    image: mysql:5.7.26
    ports:
      - "3305:3306"
    volumes:
      - '/var/mysql:/var/lib/mysql'
    command: --default-authentication-plugin=mysql_native_password
    environment:
      MYSQL_ROOT_PASSWORD: '1234'
      MYSQL_DATABASE: 'openfire'

volumes:
  openfire-data:
  openfire-plugins:

As you can see, two openfire services are running on the same network. We link the openfire services with the following setting hazelcast-local-config.xml:

<?xml version="1.0" encoding="UTF-8"?>
<hazelcast xmlns="http://www.hazelcast.com/schema/config"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xsi:schemaLocation="http://www.hazelcast.com/schema/config
           http://www.hazelcast.com/schema/config/hazelcast-config-3.12.xsd">
    <group>
        <name>openfire</name>
        <password>openfire</password>
    </group>
    <network>
        <port auto-increment="true" port-count="100">5701</port>
        <outbound-ports>
            <ports>0</ports>
        </outbound-ports>
        <!-- The following enables multicast discovery of cluster members
                See http://docs.hazelcast.org/docs/3.12/manual/html-single/index.html#discovering-members-by-multicast
        -->
        <!--
            <join>
                <multicast enabled="true">
                    <multicast-group>224.2.2.3</multicast-group>
                    <multicast-port>54327</multicast-port>
                </multicast>
                <tcp-ip enabled="false"/>
        </join>
        -->
        <!-- The following enables TCP/IP based discovery of cluster members
                See http://docs.hazelcast.org/docs/3.12/manual/html-single/index.html#discovering-members-by-tcp
        -->

+        <join>
+            <multicast enabled="false"/>
+            <tcp-ip enabled="true">
+                <member>openfire1:5701</member>
+                <member>openfire2:5701</member>
+            </tcp-ip>
+        </join>

        <interfaces enabled="false">
            <interface>10.10.1.*</interface>
        </interfaces>
        <ssl enabled="false"/>
        <socket-interceptor enabled="false"/>
        <symmetric-encryption enabled="false">
            <!--
               encryption algorithm such as
               DES/ECB/PKCS5Padding,
               PBEWithMD5AndDES,
               AES/CBC/PKCS5Padding,
               Blowfish,
               DESede
            -->
            <algorithm>PBEWithMD5AndDES</algorithm>
            <!-- salt value to use when generating the secret key -->
            <salt>thesalt</salt>
            <!-- pass phrase to use when generating the secret key -->
            <password>thepass</password>
            <!-- iteration count to use when generating the secret key -->
            <iteration-count>19</iteration-count>
        </symmetric-encryption>
    </network>
</hazelcast>

We use haproxy to access and balance Openfire with BOSH from outside. The reason for using Haproxy, the sticky session feature appeares.We need the sticky session feature thats why if we had balanced without sticky session feature, the user would always get 404 error for http-bind. As far as we know Openfire cannot be managed session with hazelcast.

The haproxy.cfg file used for balance is as follows:

global
  maxconn 4096
  pidfile ~/tmp/haproxy-queue.pid

defaults
  log global
  log 127.0.0.1 local0
  log 127.0.0.1 local1 notice
  mode http
  timeout connect 300000
  timeout client 300000
  timeout server 300000
  maxconn 2000
  option redispatch
  retries 3
  option httpclose
  option httplog
  option forwardfor
  option httpchk HEAD / HTTP/1.0

backend openfire_restapi
  balance roundrobin
  mode http
  server openfire_restapi1 localhost:19090 check
  server openfire_restapi2 localhost:19091 check

backend openfire_http_bind
  balance roundrobin
  mode http
  http-check expect status 404

  cookie SERVERID insert indirect nocache

  server ha1 localhost:17070 check cookie ha1
  server ha2 localhost:17071 check cookie ha2

frontend openfire_balance
  bind *:10071 # HAPROXY PORT
  mode http

  acl url_http_bind path_beg /http-bind
  acl url_chatservice path_beg /plugins/

  use_backend openfire_restapi if url_chatservice
  use_backend openfire_http_bind if url_http_bind

As you can see, balance can be provide with the haproxy 10071 port. We have an nginx server with port 80 out of which can access Haproxy. The settings of the nginx server is as follows:

upstream openfire_upstream {
    # Only one machine
    # Because haproxy is in the back-end
    server localhost:10071;
}

server {
    listen 80;
    server_name exampleopenfire.com;

    location /plugins/ {
        # Openfire Services
        proxy_pass http://openfire_upstream/plugins/;
    }

    location /http-bind/ {
        # Openfire BOSH Connection
        proxy_pass http://openfire_upstream/http-bind/;
    }
}

Our Architecture

Arch

With the structure, we are able to run the personal conversation (<message type="chat") properly and any problem is not occured. When we tried to make groupchat conversations, it seems there is not any problem, but when the cookie value (SERVERID) of the user is changed, we recognize the error package below.

Regardless of the cookies, we form a group and we can chat with each other. We removed one of our staff from this group and added again (with RESTAPI plugin). One of our staff changes his SESSIONID from ha1 to ha2. The one’s messages are not able to transmitted to the group chat room and he cannot write any message to the group chat room. While the one changes his SESSIONID from ha2 to ha1 again, his messages are able to transmitted to the group chat room and he can write a message to the group chat room. That means it work. This situation is experienced by all staff in the group chat. When the group is assembled at first, they all are able to chat without any problems. But the staff whose cookies are changed has encountered the error. In summary, the messages in the group are not received and any messages are not able to write to the group chat room. (not-acceptable) error

<error code="406" type="modify"><not-acceptable xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/></error>

NOT

SESSIONID is equal ha1, it assigns a request to openfire1 service. SESSIONID is equal ha2, it assigns a request to openfire2 service

Only BOSH is used on Openfire. (not using WebSocket).

Any configuration is not done.

Only RESTAPI plugin and Hazelcast plugin are used. The RESTAPI plugin is used to create a group, add member and .

Kind regards.

speedy · November 1, 2019, 1:39pm

Im currious why you have nginx infront of haproxy. Have you tried removing that layer, and let haproxy handle all the connections?

have you tried removing option httpclose from HA? this option has caused my problems in the past, but may be unreleated to your issue.

abdurrahmanekr · November 2, 2019, 12:03pm

We have already applied one-to-one what you said, but unfortunately we experienced the same result again. So, we present a scenerio to explain the issue in a better way. When we do following steps, we have a touble.
The whole scenerio with the following steps are experienced with openfire-restAPI-plugin. (add any person to group/ remove any person to group/create any room.)

user1: SERVERID=ha1
user2: SERVERID=ha1

Both users are online.
user1 creates a group named test1.
Both users, user1 and user2 are successfully messaging in the room.
user1 makes user2 room owner of the group.
user1 leaves the room. user2 is alone in the room.
user1 is now offline. (user1 cannot send any package. )
user2 adds the user1 as a member in the room.
user1 is online and joins the room.(join)
user1 and user2 are successfully messaging in the room.
From now on, both users’ SERVERID=ha1.
user1 is now offline and he makes his SERVERID=ha2. Sign in again and go online.
user1 is now online with his SERVERID=ha2.
But, now user1 cannot join in the room. (not-acceptable)

Actually, the problem all is that. This problem starts when any user leaves the room. If any user does not leave the room and he is online with ha1 and ha2, any problem does not exist. Once the user has been offline, both user cannot type to messages each other except the first server to which they join the room.

guus · November 4, 2019, 8:41am

It would be good to try and eliminate as many factors as possible in an attempt to recreated this problem.

Could you try to reproduce this problem with a client that uses a regular TCP connection (port 5222), and not a HTTP-based connection (BOSH or websocket). You could use a client like Spark for this, that allows you to explicitly define to which server (in the cluster) it should connect for.

If you can reproduce the problem with that, then we know that the problem is likely part of the cluster implementation. If you cannot reproduce the problem with that client, then I’m thinking the problem is caused by something in the elaborate environment that you are using. I am wondering, for example, if the way in which you are switching nodes makes the client session reconnect properly.

Also: enable debug logging, and look in those logs to see if you can find any oddities.

abdurrahmanekr · November 4, 2019, 8:53am

I’ll come back to you after trying what you say. Could there be a problem with the following code?

LocalMUCUser.java
(removeRole)

if (Presence.Type.unavailable == packet.getType()) {
    try {
        // TODO Consider that different nodes can be creating and processing this presence at the same time (when remote node went down)
        role.setPresence(packet);
        removeRole(group);
        role.getChatRoom().leaveRoom(role);
    }
    catch (Exception e) {
        Log.error(e.getMessage(), e);
    }
}

abdurrahmanekr · November 5, 2019, 7:48am

Hi @guus. worked on TCP in an incomprehensible way. When using BOSH, I came across an error that is actually the root cause of the error:

java.lang.NullPointerException: null
at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.joinRoom(LocalMUCRoom.java:680) ~[xmppserver-4.4.2.jar:4.4.2]

In fact, in the following code, the joinRole variable is null. (LocalMUCRoom.java)

} else {
    // Grab the existing one.
    joinRole = occupantsByFullJID.get(user.getAddress());
    joinRole.setPresence( presence ); // OF-1581: Use latest presence information.
}

The occupantsByFullJID variable and the occupantsByNickname variable have inconsistent values. Therefore, when the user is in occupantsByNickname, the variable “clientOnlyJoin” becomes “true” but “joinRole” is null because the user is not in “occupantsByFullJID”.

This problem does not occur on the user-joined node. Because these two variables are consistent only then.

abdurrahmanekr · November 5, 2019, 10:19am

That’s how we solved the problem. It works correctly when we check the occupantsByNickname and occupantsByFullJID and set the “clientOnlyJoin” variable.

LocalMUCRoom.java:585

...
            // Check if the nickname is already used in the room
            if (occupantsByNickname.containsKey(nickname.toLowerCase())) {
                List<MUCRole> occupants = occupantsByNickname.get(nickname.toLowerCase());
                MUCRole occupant = occupants.size() > 0 ? occupants.get(0) : null;
                if (occupant != null && !occupant.getUserAddress().toBareJID().equals(bareJID.toBareJID())) {
                    // Nickname is already used, and not by the same JID
                    throw new UserAlreadyExistsException();
                }
                // Is this client already joined with this nickname?
                for (MUCRole mucRole : occupants) {
                    if (mucRole.getUserAddress().equals(user.getAddress())) {
+                        if (occupantsByFullJID.get(user.getAddress()) != null)
                            clientOnlyJoin = true;
                        break;
                    }
                }
            }
...

Would you like me to open this as a PR? Or should I open an issue about this problem?

Anno_van_Vliet · November 6, 2019, 1:43pm

We experience the same issue and nullpointer. It occurs after restarting one node of the cluster. Clients are reconnecting to the server and to the room, but are not receiving messages anymore. Also they cannot be found back in the Console as participants in the room although they are counted in the number of participants.

speedy · November 6, 2019, 9:55pm

Thank you for your contribution and welcome! Please submit a PR. If you’re interested in helping out more, you’re welcome to check out our jira.
https://issues.igniterealtime.org/

abdurrahmanekr · November 7, 2019, 10:09am

It’s cool for us to contribute to this beautiful project. I’m opened a PR: https://github.com/igniterealtime/Openfire/pull/1521