Galera: Connection timed out scenarios

While joining an existing Galera cluster, if a node fails to “reach” any of the nodes in the cluster, it generally – aborts – with the following error :

140505 10:01:46 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '127.0.0.1:4567'
140505 10:01:49 [Warning] WSREP: no nodes coming from prim view, prim not possible
140505 10:01:49 [Note] WSREP: view(view_id(NON_PRIM,cb63893e-d45d-11e3-9dee-0a8c1484c284,1) memb {
        cb63893e-d45d-11e3-9dee-0a8c1484c284,0
} joined {
} left {
} partitioned {
})
140505 10:01:49 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.51674S), skipping check
140505 10:02:19 [Note] WSREP: view((empty))
140505 10:02:19 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():141
140505 10:02:19 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -110 (Connection timed out)
140505 10:02:19 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'my_wsrep_cluster' at 'gcomm://127.0.0.1:4567': -110 (Connection timed out)
140505 10:02:19 [ERROR] WSREP: gcs connect failed: Connection timed out
140505 10:02:19 [ERROR] WSREP: wsrep::connect() failed: 7
140505 10:02:19 [ERROR] Aborting

Assuming that there are no network issues, this error may occur due to any of the following scenarios:

Scenario #1: A node tries join a non-existing cluster

This issue and the possible resolution has been discussed here.

Scenario #2: A node with Galera 3 (25.3.xx) wsrep provider tries to join an existing cluster consisting of Galera 2 (25.2.xx) nodes

This results in error because, by default, Galera-2 and Galera-3 uses different checksum algorithms on network packets. While, for Galera-2 its plain CRC32 (socket.checksum=1), Galera-3 uses CRC32-C hw-accelerated (socket.checksum=2). Checksum algorithms can be controlled by socket.checksum galera parameter.

So, the solution to this problem would be to start the Galera-3 node with wsrep_provider_options=’socket.checksum=1′ option to make sure it uses the same checksum algorithm which other nodes of the cluster (Galera-2) are using.

References: http://galeracluster.com/documentation-webpages/galeraparameters.html

Scenario #3: telnet: Unable to connect to remote host: No route to host

Solution: Configure firewall settings :

(a) Flush all the firewall rules: Quick, but advisable for test setups only.
$ sudo iptables -F ## Thanks to Qian Joe for the suggestion!

(b) Allow only specific hosts/ports: Its important to note that by flushing (-F), one would essentially delete all the firewall rules. This is undesirable. Only specific ports & hosts should be allowed instead. (see firewall settings)

EDIT: Do let me know (via comment) if the given resolutions did not work for you and I will be happy to add the failure scenario & resolution here.

4 thoughts on “Galera: Connection timed out scenarios”

  1. Thank you for sharing it. I had the same problem, but not above two scenarios, I tried to telnet the ‘node1’ and got below error message.

    Scenario #3 telnet: Unable to connect to remote host: No route to host

    Solution: Reset firewall by using:
    $ sudo iptables -F

  2. Hi,

    I have the same error but I was unable to resolve it via the option wsrep_provider_options=’socket.checksum=1′ .

    I did a iptable -L to get the list and here is what is says:

    ACCEPT tcp — 192.168.56.0/24 anywhere tcp dpt:mysql
    ACCEPT tcp — 192.168.56.0/24 anywhere tcp dpt:tram
    ACCEPT tcp — 192.168.56.0/24 anywhere tcp dpt:rsync

Leave a Reply

Your email address will not be published. Required fields are marked *