Clustering not working

Hello,

I am running RHEL5 on two machines connected via switch. I have installed Openfire3.4.0 Beta with Clustering, and have sucessfully installed the enterprise.jar file with the beta license. Clustering is supposed to be available now, but when I tried to enable clustering on either machine, I get an error saying:

“Failed to start or join an existing cluster. Check the log for more information.”

Did you check the log?

Might be worth posting any errors from the log here so that someone can help.

Hi,

I’m working with the original poster (Lantern). I’ve pasted the error log and the debug log below, starting with the moment we try to enable clustering on the administrator’s clustering page. I noticed that the debug log looks like something is trying to connect to www.igniterealtime.org. Our server does not have an internet connection. Not sure if that matters.

////////////// ERROR LOG //////////////////////////

2007.10.18 10:33:57 com.jivesoftware.util.cache.CoherenceClusteredCacheFactory.startCluster(Coherenc eClusteredCacheFactory.java:116) Unable to start clustering - continuing in local mode

java.lang.RuntimeException: Failed to start Service “Cluster” (ServiceState=SERVICE_STOPPED, STATE_JOINED)

at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.waitAccepti ngClients(Service.CDB:12)

at com.tangosol.coherence.component.net.Cluster$ClusterService.waitAcceptingClient s(Cluster.CDB:1)

at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.poll(Servic e.CDB:8)

at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.poll(Servic e.CDB:18)

at com.tangosol.coherence.component.util.daemon.queueProcessor.service.ClusterServ ice.ensureService(ClusterService.CDB:15)

at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.start(Servi ce.CDB:25)

at com.tangosol.coherence.component.util.SafeService.startService(SafeService.CDB: 16)

at com.tangosol.coherence.component.util.safeService.SafeCacheService.startService (SafeCacheService.CDB:5)

at com.tangosol.coherence.component.util.SafeService.restartService(SafeService.CD B:17)

at com.tangosol.coherence.component.util.SafeService.ensureRunningService(SafeServ ice.CDB:36)

at com.tangosol.coherence.component.util.SafeService.start(SafeService.CDB:14)

at com.tangosol.net.DefaultConfigurableCacheFactory.ensureService(DefaultConfigura bleCacheFactory.java:810)

at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurabl eCacheFactory.java:679)

at com.tangosol.net.DefaultConfigurableCacheFactory.configureCache(DefaultConfigur ableCacheFactory.java:872)

at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurabl eCacheFactory.java:277)

at com.tangosol.net.CacheFactory.getCache(CacheFactory.java:689)

at com.tangosol.net.CacheFactory.getCache(CacheFactory.java:667)

at com.jivesoftware.util.cache.CoherenceClusteredCacheFactory.startCluster(Coheren ceClusteredCacheFactory.java:86)

at org.jivesoftware.util.cache.CacheFactory.startClustering(CacheFactory.java:298)

at org.jivesoftware.openfire.cluster.ClusterManager.startup(ClusterManager.java:25 8)

at org.jivesoftware.openfire.cluster.ClusterManager.setClusteringEnabled(ClusterMa nager.java:310)

at org.jivesoftware.openfire.admin.system_002dclustering_jsp._jspService(system_00 2dclustering_jsp.java:88)

at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:487)

at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.ja va:1093)

at com.opensymphony.module.sitemesh.filter.PageFilter.parsePage(PageFilter.java:11 8)

at com.opensymphony.module.sitemesh.filter.PageFilter.doFilter(PageFilter.java:52)

at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.ja va:1084)

at org.jivesoftware.util.LocaleFilter.doFilter(LocaleFilter.java:65)

at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.ja va:1084)

at org.jivesoftware.util.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingF ilter.java:41)

at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.ja va:1084)

at org.jivesoftware.admin.PluginFilter.doFilter(PluginFilter.java:69)

at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.ja va:1084)

at org.jivesoftware.admin.AuthCheckFilter.doFilter(AuthCheckFilter.java:98)

at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.ja va:1084)

at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:360)

at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)

at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)

at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)

at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)

at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollect ion.java:211)

at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)

at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)

at org.mortbay.jetty.Server.handle(Server.java:313)

at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:506)

at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:844 )

at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:644)

at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:211)

at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:381)

at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:396)

at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

//////////////// DEBUG LOG /////////////////////

2007.10.18 10:36:33 Set parameter http.connection.timeout = 3000

2007.10.18 10:36:33 Set parameter http.socket.timeout = 3000

2007.10.18 10:36:33 Open connection to www.igniterealtime.org:80

2007.10.18 10:36:33 Closing the connection.

2007.10.18 10:36:33 Method retry handler returned false. Automatic recovery will not be attempted

2007.10.18 10:36:33 Releasing connection back to connection manager.

2007.10.18 10:36:33 Stat: sever_sessions. Last sample: 1192718100. New sample: 1192718160

2007.10.18 10:36:33 Stat: packet_count. Last sample: 1192718100. New sample: 1192718160

2007.10.18 10:36:33 Stat: server_bytes. Last sample: 1192718100. New sample: 1192718160

2007.10.18 10:36:33 Stat: muc_traffic. Last sample: 1192718100. New sample: 1192718160

2007.10.18 10:36:33 Stat: proxyTransferRate. Last sample: 1192718100. New sample: 1192718160

2007.10.18 10:36:33 Stat: muc_rooms. Last sample: 1192718100. New sample: 1192718160

2007.10.18 10:36:33 Stat: muc_users. Last sample: 1192718100. New sample: 1192718160

2007.10.18 10:36:33 Stat: sessions. Last sample: 1192718100. New sample: 1192718160

2007.10.18 10:36:33 Stat: conversations. Last sample: 1192718100. New sample: 1192718160

2007.10.18 10:36:33 Stat: muc_occupants. Last sample: 1192718100. New sample: 1192718160

2007.10.18 10:37:33 Stat: sever_sessions. Last sample: 1192718160. New sample: 1192718220

2007.10.18 10:37:33 Stat: packet_count. Last sample: 1192718160. New sample: 1192718220

2007.10.18 10:37:33 Stat: server_bytes. Last sample: 1192718160. New sample: 1192718220

2007.10.18 10:37:33 Stat: muc_traffic. Last sample: 1192718160. New sample: 1192718220

2007.10.18 10:37:33 Stat: proxyTransferRate. Last sample: 1192718160. New sample: 1192718220

2007.10.18 10:37:33 Stat: muc_rooms. Last sample: 1192718160. New sample: 1192718220

2007.10.18 10:37:33 Stat: muc_users. Last sample: 1192718160. New sample: 1192718220

2007.10.18 10:37:33 Stat: sessions. Last sample: 1192718160. New sample: 1192718220

2007.10.18 10:37:33 Stat: conversations. Last sample: 1192718160. New sample: 1192718220

2007.10.18 10:37:33 Stat: muc_occupants. Last sample: 1192718160. New sample: 1192718220

Hey Trent,

Do see similar issues when using other machines? Also can you provide details regarding your setup including:

  1. OS vendor/version

  2. JVM vendor/version

  3. network setup, i.e. NIC speed & type

Is this the first machine that is starting up the cluster? Or is there other machines in the cluster and you are trying to add a new member to an existing cluster? In any case, make sure that the clocks between the machines are synchronized.

Regards,

– Gato

Hey Gato,

I can get you more detailed info, but in the meantime, the high level summary is that it’s Red Hat Enterprise Linux 5 with Java 5.

I was using Wireshark to analyze the packets. When I try to enable clustering, I noticed a bunch if IGMP requests going to 224.0.0.22 with the description “V3 Membership Report.” This happens for about 50 seconds (until the error message pops up on the admin console). Do the clusters find each other by multicast? My machines are hooked up to each other via a switch, not a router. Could this be causing the problem?

  • Trent

Here is the message I get at the console where I start OpenFire via the startup script:

2007-10-22 13:56:00.567 Oracle Coherence GE 3.3/387 <Warning> (thread=PacketPublisher, member=n/a): UnicastUdpSocket failed to set receive buffer size to 1428 packets (2096304 bytes); actual size is 74 packets (109568 bytes). Consult your OS documentation regarding increasing the maximum socket buffer size. Proceeding with the actual value may cause sub-optimal performance.

2007-10-22 13:56:00.570 Oracle Coherence GE 3.3/387 <D5> (thread=PacketPublisher, member=n/a): Attempt to refresh sockets: UnicastUdpSocket{State=STATE_OPEN, address:port=10.52.0.36:8088}, MulticastUdpSocket{State=STATE_OPEN, address:port=224.3.3.0:32386, InterfaceAddress=10.52.0.36, TimeToLive=4}, TcpSocketAccepter{State=STATE_OPEN, ServerSocket=10.52.0.36:8088} caused by MulticastUdpSocket{State=STATE_OPEN, address:port=224.3.3.0:32386, InterfaceAddress=10.52.0.36, TimeToLive=4}; exception java.io.IOException: Network is unreachable

at java.net.PlainDatagramSocketImpl.send(Native Method)

at java.net.DatagramSocket.send(Unknown Source)

at com.tangosol.coherence.component.net.socket.UdpSocket.send(UdpSocket.CDB:18)

at com.tangosol.coherence.component.net.udpPacket.OutgoingUdpPacket.send(OutgoingU dpPacket.CDB:10)

at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.Pac ketSpeaker$BundlingQueue.flush(PacketSpeaker.CDB:62)

at com.tangosol.coherence.component.util.queue.ConcurrentQueue.flush(ConcurrentQue ue.CDB:1)

at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.Pac ketPublisher.flushSend(PacketPublisher.CDB:1)

at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.Pac ketPublisher.onWait(PacketPublisher.CDB:1)

at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:32)

at java.lang.Thread.run(Unknown Source)

The main reason for this that i have seen is when the hostname isn’t valid. I had a zone (Solaris) configured to use the name “cluster1” and i didn’t have “cluster1” defined in the /etc/hosts file. Another thing that I had cause all sorts of wackyness was when i used the enterprise plugin (not the jar) but from the UI. It started messing up a lot of things and i had to drop my database to get it to work properly (the enterprise.jar upload didn’t update it as it should have).

Hope this helps someone,

  • Mick

Hey Trent,

Yes, Coherence (clustering solution used by Openfire) uses multicast to find the other cluster nodes. When multicast is not available you have other options like manually indicating where are the other cluster nodes.

Check out the Coherence wiki about Foundry Switches or Cisco Switches. Both of those documents provide some tools for testing and possible workaround when multicast is not an option.

Regards,

– Gato

Hi Trent,

I did create two documents Clustering Openfire - Unicast and fix UnicastUdpSocket failed to set receive buffer size to 1428 packets and I hope that both help you to get the cluster running even if the beta ends in five days.

LG

Hey LG,

As always…good documentation. Thanks for adding those documents.

Regards,

– Gato