Systemd: Managing multiple MariaDB instances

Many Linux distributions now come pre-installed with Systemd. MariaDB added support for Systemd in version 10.1. This support mostly came as a contribution from MariaDB community. In this post, I will discuss how to manage multiple instances of MariaDB server using Systemd. Once MariaDB packages are installed, the Systemd unit files can be located under /lib/systemd/system/.

$ find /lib/systemd/system -name 'mariadb*'
/lib/systemd/system/mariadb.service
/lib/systemd/system/mariadb@bootstrap.service.d
/lib/systemd/system/mariadb@.service

As we are going to deal with multiple instances, we are interested in the last unit file. With the current implementation, there are two ways to handle multiple instances.

  1. Individual configuration files for each instance

    This is the default method for which you do not have to make any changes to the unit file. All you need is create separate configuration files for each MariaDB instance under the standard MariaDB configuration directory /etc/mysql/conf.d/ named with distinct suffixes (aka instance name) but common prefix “my”.

    For instance, in my test setup, I have the following two option files:

    $ find /etc/mysql/conf.d/ -name 'my*'
    /etc/mysql/conf.d/mynode1.cnf
    /etc/mysql/conf.d/mynode2.cnf
    

    That’s it. You can now start MariaDB instances using the following command:

    $ sudo systemctl start mariadb@{suffix}
    
    $ sudo systemctl start mariadb@node1
    $ sudo systemctl start mariadb@node2
    

    So, why should configuration files have common prefix “my” but different suffixes? If you look into the unit file, the command used to start a MariaDB instance uses “%I” in –defaults-file option. This gets replaced by the instance name – the suffix used in the above command. Thus, MariaDB gets started with the corresponding configuration file.

    ExecStart=/usr/sbin/mysqld --defaults-file=/etc/mysql/conf.d/my%I.cnf \
      $_WSREP_NEW_CLUSTER $_WSREP_START_POSITION%I $MYSQLD_OPTS
    
  2. Using separate group suffixes in single configuration file

    This method makes use of MariaDB’s –defaults-group-suffix option. So, you can add options for all the instances in a single configuration file, such that options with values unique to a particular instance are placed in a group with different suffix. The instructions related to this method are commented by default in the multi-instance unit file. So in order to use this method, the multi-instance unit file (/lib/systemd/system/mariadb@.service) needs to be modified.

    • Comment – ConditionPathExists=/etc/mysql/conf.d/my%I.cnf
    • Comment all ExecStartPre and ExecStart commands with –defaults-file
    • Uncomment all ExecStartPre and ExecStart commands with –defaults-group-suffix
    • Reload the systemd daemon:
      $ sudo systemctl daemon-reload
      

    As you might have guessed by now, in this method the instance name “%I” is used in the –defaults-group-suffix (unlike the first method).

    ExecStart=/usr/sbin/mysqld --defaults-group-suffix=%I \
      $_WSREP_NEW_CLUSTER $_WSREP_START_POSITION%I $MYSQLD_OPTS
    

    Thus, each instance is started with options from the group with specified suffix along with the option listed under common group(s) like [mariadb], [mysqld], [server] or [galera].

    For example, the configuration file in my test setup has the following groups :

    $ cat /etc/mysql/my.cnf 
    [mysqld]
    <all common options>
    
    [mysqldnode1]
    <all options specific to node1>
    
    [mysqldnode2]
    <all options specific to node2>
    

    That’s all. Now, same set of commands can be used to start/stop/restart/status MariaDB instances.

    
    $ sudo systemctl start mariadb@node1
    $ sudo systemctl start mariadb@node2
    

    A word of caution: do not use period in group suffix, as that will cause set-environment to fail.

    $ sudo systemctl set-environment MY.SUFFIX=""
    Failed to set environment: Invalid environment assignments
    

Recent bug fixes:
* MDEV-10004: Galera’s pc.recovery process fails in 10.1 with systemd
* MDEV-10145: Systemd fails to start mysqld in multi-instance mode

Some useful links:
* https://mariadb.com/kb/en/mariadb/systemd/

Reading MariaDB/MySQL binary log files in Go

While looking into a replication-related issue, I came across an interesting problem where I wanted to get the name of the largest event in a very large binary log file. So, I thought of putting Vaquita’s binary log API to a test. Here’s a simple program to read a binary log file and print the largest event and its size.

/*
  @File : largest_event.go
  @Description : Find the largest event in a binary log file.
  @Usage : ./largest_event.go -file=<binary log file>
*/

package main

import (
    "flag"
    "fmt"
    "github.com/vaquita/mysql"
    "os"
)

func main() {
    var (
        b         mysql.Binlog
        dsn       string
        eventName string
        eventSize uint32
        file      string
    )

    flag.StringVar(&file, "file", "/tmp/master-bin.000001",
        "Binary log file (absolute path)")
    flag.Parse()

    dsn = fmt.Sprint("file://", file)

    // Open a binary log handle
    if err := b.Connect(dsn); err != nil {
        fmt.Println(err)
        os.Exit(1)
    }

    // Start reading from the specified binary log file
    if err := b.Begin(); err != nil {
        fmt.Println(err)
        os.Exit(1)
    }

    // Close the binary log handle
    defer b.Close()

    // Iterate through the events
    for b.Next() {
                re, err := b.RawEvent()

                if err != nil {
                        fmt.Println(err)
                        os.Exit(1)
                }

        if re.Size() > eventSize {
            eventSize = re.Size()
            eventName = re.Name()
        }
    }

    fmt.Println("Event :", eventName)
    fmt.Println("Size  :", eventSize)
}

In this case, having RawEvents is sufficient as we are not interested in the actual content of the events.
Wiki : https://github.com/vaquita/mysql/wiki/Replication-Binary-Logs

Edit: Updated the above program to account for a recent change done to RawEvent(), which now returns error.

Capturing MySQL packets

While working on Vaquita, I spent a good time analyzing the TCP packets transferred between MySQL server and client. The analysis gave a good insight of the communication protocols used by MySQL. I used ngrep to perform the packet capturing. Its a simple, yet powerful tool.

$sudo ngrep -x -q -d lo port 16001
# where,
#   -x          : Dump packet contents as hexadecimal
#   -q          : Be quiet, do not print '#' for every packet received
#   -d <dev>    : Listen to 'dev' interface only
#   port <port> : Print packets with this source or destination port only

interface: lo (127.0.0.0/255.0.0.0)
filter: (ip or ip6) and ( port 16001 )

T 127.0.0.1:16001 -> 127.0.0.1:41499 [AP]
  5e 00 00 00 0a 35 2e 35    2e 35 2d 31 30 2e 31 2e    ^....5.5.5-10.1.
  38 2d 4d 61 72 69 61 44    42 2d 64 65 62 75 67 00    8-MariaDB-debug.
  07 00 00 00 5f 50 50 23    58 76 69 2e 00 ff f7 08    ...._PP#Xvi.....
  02 00 3f a0 15 00 00 00    00 00 00 00 00 00 00 45    ..?............E
  52 4a 5b 6a 25 75 3f 7b    6f 3b 74 00 6d 79 73 71    RJ[j%u?{o;t.mysq
  6c 5f 6e 61 74 69 76 65    5f 70 61 73 73 77 6f 72    l_native_passwor
  64 00                                                 d.              

T 127.0.0.1:41499 -> 127.0.0.1:16001 [AP]
  a7 00 00 01 0d a6 3f 20    00 00 00 01 21 00 00 00    ......? ....!...
  00 00 00 00 00 00 00 00    00 00 00 00 00 00 00 00    ................
  00 00 00 00 72 6f 6f 74    00 00 74 65 73 74 00 6d    ....root..test.m
  79 73 71 6c 5f 6e 61 74    69 76 65 5f 70 61 73 73    ysql_native_pass
  77 6f 72 64 00 65 03 5f    6f 73 05 4c 69 6e 75 78    word.e._os.Linux
  0c 5f 63 6c 69 65 6e 74    5f 6e 61 6d 65 08 6c 69    ._client_name.li
  62 6d 79 73 71 6c 04 5f    70 69 64 04 39 31 39 37    bmysql._pid.9197
  0f 5f 63 6c 69 65 6e 74    5f 76 65 72 73 69 6f 6e    ._client_version
  06 31 30 2e 31 2e 38 09    5f 70 6c 61 74 66 6f 72    .10.1.8._platfor
  6d 06 78 38 36 5f 36 34    0c 70 72 6f 67 72 61 6d    m.x86_64.program
  5f 6e 61 6d 65 05 6d 79    73 71 6c                   _name.mysql     

.. snip ..

T 127.0.0.1:41499 -> 127.0.0.1:16001 [AP]
  11 00 00 00 03 53 45 4c    45 43 54 20 56 45 52 53    .....SELECT VERS
  49 4f 4e 28 29                                        ION()           

T 127.0.0.1:16001 -> 127.0.0.1:41499 [AP]
  01 00 00 01 01 1f 00 00    02 03 64 65 66 00 00 00    ..........def...
  09 56 45 52 53 49 4f 4e    28 29 00 0c 21 00 3c 00    .VERSION()..!.<.
  00 00 fd 01 00 1f 00 00    05 00 00 03 fe 00 00 02    ................
  00 15 00 00 04 14 31 30    2e 31 2e 38 2d 4d 61 72    ......10.1.8-Mar
  69 61 44 42 2d 64 65 62    75 67 05 00 00 05 fe 00    iaDB-debug......
  00 02 00                                              ...        

Reference : https://dev.mysql.com/doc/internals/en/a-mysql-client-logs-in.html

Donor selection in a Galera cluster

In a Galera cluster, when a node joins or rejoins the cluster it needs to acquire the state in order to sync with the cluster. Galera implements a pretty decent algorithm to elect a node as donor of the state, aka Donor node. In this post, I will attempt to describe the algorithm in a simplified manner.
When a node joins the cluster, it first checks if it can receive the state via incremental state transfer (IST) instead of a full snapshot state transfer (SST). This can happen when the node was previously part of the cluster (with same group UUID), and its missing updates are still cached on at least one of the nodes in the cluster. If none of the node have required updates cached, it will fall back to looking for a suitable node for a full snapshot state transfer (SST).

Note: The following algorithm is derived from Galera v3.12. It may change in future versions.

Donor selection algorithm:

  1. First, try to find an IST donor (a node that has the joiner’s missing updates cached into its gcache) by looking for
    1. a SYNCED node from the wsrep-sst-donor list with the highest cache sequence number, else
    2. a SYNCED node which is
      1. not stateless (i.e. not garbd) and
      2. a local node (in same segment as joiner) with highest cache sequence number, else
      3. a remote node (in a different segment) with highest cache sequence number
    150917 21:48:06 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.0.2.15:4011
    150917 21:48:06 [Note] WSREP: Member 0.0 (my_node2) requested state transfer from '*any*'. Selected 1.0 (my_node1)(SYNCED) as donor.
    150917 21:48:06 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 10)
    150917 21:48:06 [Note] WSREP: Requesting state transfer: success, donor: 1
    150917 21:48:06 [Note] WSREP: 1.0 (my_node1): State transfer to 0.0 (my_node2) complete.
    
  2. Else, try to find an SST donor, by looking for
    1. a SYNCED node from the wsrep-sst-donor list, else
    2. a SYNCED node which is
      1. not stateless (i.e. not garbd) and
      2. a local node in same segment as joiner, else
      3. a remote node from a different segment
150917 21:42:21 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (925fb059-1b51-11e5-a295-7739e970a4a4): 1 (Operation not permitted)
     at galera/src/replicator_str.cpp:prepare_for_IST():456. IST will be unavailable.
150917 21:42:21 [Note] WSREP: Member 1.0 (my_node2) requested state transfer from '*any*'. Selected 0.0 (my_node1)(SYNCED) as donor.
150917 21:42:21 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 5)
150917 21:42:21 [Note] WSREP: Requesting state transfer: success, donor: 0
150917 21:42:23 [Note] WSREP: (8052c243, 'tcp://0.0.0.0:4010') turning message relay requesting off
150917 21:42:24 [Note] WSREP: 0.0 (my_node1): State transfer to 1.0 (my_node2) complete.
150917 21:42:24 [Note] WSREP: Member 0.0 (my_node1) synced with group.

Securing MariaDB Galera Cluster using SSL

In MariaDB Galera cluster, the data (writeset) is replicated across all the nodes over network. This data traffic can be secured by enabling SSL connection between the nodes. In order to achieve this, one needs to have SSL certificate/key pair (Here’s how to generate one). The certificate and key can then be copied to all the nodes. Once done, instruct the nodes to establish SSL connection on startup by simply pointing to the certificate/key files using wsrep_provider_options system variable.

wsrep_provider_options='socket.ssl_cert=/tmp/cert.pem;socket.ssl_key=/tmp/key.pem'

Following is the list of SSL options that Galera supports (details : Galera parameters:

  • socket.ssl : Enable/disable SSL, explicitly
  • socket.ssl_ca : SSL CA file
  • socket.ssl_cert : SSL certificate file
  • socket.ssl_cipher : SSL cipher list
  • socket.ssl_compression : Enable/disable SSL compression
  • socket.ssl_key : SSL key file
  • socket.ssl_password_file : SSL password file, in case the key is encrypted

At the time of writing, there is no way to check if galera connection is encrypted using SQL (issue#165). The only way is to look into the node’s error log for the following :

150516 14:22:03 [Note] WSREP: SSL handshake successful, remote endpoint ssl://127.0.0.1:46661 local endpoint ssl://127.0.0.1:4000 cipher: AES128-SHA compression: 
150516 14:22:03 [Note] WSREP: (6f49f928, 'ssl://0.0.0.0:4000') turning message relay requesting on, nonlive peers: 
150516 14:22:03 [Note] WSREP: declaring 73263ae6 at ssl://127.0.0.1:4010 stable

Lastly, it is important to note that SST (snapshot state transfer) traffic is not affected, whatsoever, by the use of galera SSL options.

Generating self-signed SSL certificate/key pair

Here, I present a simple command to generate a self signed SSL certificate/key pair that can be used to secure the communication channel between communicating parties.

$ openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days `echo "365 * 2" | bc` -nodes
Generating a 2048 bit RSA private key
.....................................+++
.....................................+++
unable to write 'random state'
writing new private key to 'key.pem'
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:
State or Province Name (full name) [Some-State]:
Locality Name (eg, city) []:
Organization Name (eg, company) [Internet Widgits Pty Ltd]:
Organizational Unit Name (eg, section) []:
Common Name (e.g. server FQDN or YOUR name) []:
Email Address []:

$ ls
cert.pem  key.pem

$ $ date
Mon May 11 15:18:24 EDT 2015

$ openssl x509 -noout -startdate -enddate -in cert.pem
notBefore=May 11 19:12:03 2015 GMT
notAfter=May 10 19:12:03 2017 GMT

Note: -nodes switch skips encryption of the key.

MariaDB 10.1 : A new version of GRA_X_X.log file

When a MariaDB Galera cluster node fails to apply a binary log (or writeset, as we call it), the node dumps it into a file (GRA_X_X.log) under the data directory for further investigation. This process has been explained fairly well in this Percona blog). Since the dumped log file is header-less, one has to first prepend a binary log header to it in order to open it using mysqlbinlog tool.

Starting MariaDB 10.1.4, the node will automatically prepend the binlog header to the writeset before dumping it into the GRA_ log file. In order to differentiate it from the older log files, the file has been renamed to GRA_X_X_v2.log.

$ ./bin/mysqlbinlog data2/GRA_1_1_v2.log

/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#150414 17:50:20 server id 0  end_log_pos 248  Start: binlog v 4, server v 10.1.4-MariaDB-wsrep-debug created 150414 17:50:20 at startup
ROLLBACK/*!*/;
BINLOG '
nIstVQ8AAAAA9AAAAPgAAAAAAAQAMTAuMS40LU1hcmlhREItd3NyZXAtZGVidWcAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAACciy1VEzgNAAgAEgAEBAQEEgAA3AAEGggAAAAICAgCAAAACgoKAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAEEwQAZxODxw==
'/*!*/;
# at 248
#150414 17:50:20 server id 0  end_log_pos 76   Query   thread_id=4 exec_time=0 error_code=0
use `test`/*!*/;
SET TIMESTAMP=1429048220/*!*/;
SET @@session.pseudo_thread_id=4/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=0/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
drop table t1
/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

Password validation plugins in MariaDB

Let me start off with a phrase : “A chain is only as strong as its weakest link”. So, how to ensure that all the links (=passwords) are strong enough to keep the system secure? One of the key attributes to consider here is password strength. MariaDB 10.1.2 added support for password validation by introducing a password validation plugin API and two password validation plugins. These plugins can be used to ensure that the password used for the user accounts adhere to some required security standards.

  1. simple_password_check
  2. 
    MariaDB [test]> INSTALL SONAME 'simple_password_check';
    Query OK, 0 rows affected (0.04 sec)
    
    
    MariaDB [test]> SELECT VARIABLE_NAME, DEFAULT_VALUE, VARIABLE_COMMENT FROM INFORMATION_SCHEMA.SYSTEM_VARIABLES WHERE VARIABLE_NAME LIKE 'SIMPLE_PASSWORD%'\G
    *************************** 1. row ***************************
       VARIABLE_NAME: SIMPLE_PASSWORD_CHECK_DIGITS
       DEFAULT_VALUE: 1
    VARIABLE_COMMENT: Minimal required number of digits
    *************************** 2. row ***************************
       VARIABLE_NAME: SIMPLE_PASSWORD_CHECK_LETTERS_SAME_CASE
       DEFAULT_VALUE: 1
    VARIABLE_COMMENT: Minimal required number of letters of the same letter case.This limit is applied separately to upper-case and lower-case letters
    *************************** 3. row ***************************
       VARIABLE_NAME: SIMPLE_PASSWORD_CHECK_OTHER_CHARACTERS
       DEFAULT_VALUE: 1
    VARIABLE_COMMENT: Minimal required number of other (not letters or digits) characters
    *************************** 4. row ***************************
       VARIABLE_NAME: SIMPLE_PASSWORD_CHECK_MINIMAL_LENGTH
       DEFAULT_VALUE: 8
    VARIABLE_COMMENT: Minimal required password length
    4 rows in set (0.00 sec)
    
    MariaDB [test]> SET PASSWORD = PASSWORD('abc');
    ERROR 1819 (HY000): Your password does not satisfy the current policy requirements
    
    
  3. cracklib_password_check (requires libcrack2)
  4. 
    MariaDB [test]> INSTALL SONAME 'cracklib_password_check';
    Query OK, 0 rows affected (0.05 sec)
    
    MariaDB [test]> SELECT VARIABLE_NAME, DEFAULT_VALUE, VARIABLE_COMMENT FROM INFORMATION_SCHEMA.SYSTEM_VARIABLES WHERE VARIABLE_NAME LIKE 'CRACKLIB_PASSWORD%'\G
    *************************** 1. row ***************************
       VARIABLE_NAME: CRACKLIB_PASSWORD_CHECK_DICTIONARY
       DEFAULT_VALUE: /var/cache/cracklib/cracklib_dict
    VARIABLE_COMMENT: Path to a cracklib dictionary
    1 row in set (0.00 sec)
    
    MariaDB [test]> SET PASSWORD = PASSWORD('qwerty');
    ERROR 1819 (HY000): Your password does not satisfy the current policy requirements
    
    

    So, what passwords are now forbidden? Lets check :

    $ echo "nirbhay" | cracklib-check 
    nirbhay: it is based on your username
    $ echo "password" | cracklib-check 
    password: it is based on a dictionary word
    $ echo "qwerty" | cracklib-check 
    qwerty: it is based on a dictionary word
    $ echo "123" | cracklib-check 
    123: it is WAY too short
    $ echo "123456" | cracklib-check 
    123456: it is too simplistic/systematic
    $ echo "12345654321" | cracklib-check 
    12345654321: it is too simplistic/systematic
    

Password validation plugins only validate plain-text password (for obvious reasons!). So commands that contain password hashes are not validated. In order to reject such commands strict_password_validation system variable can be used. Lastly, it important to note that multiple password validation plugins can be loaded at the same time and the password must pass on all the plugins.

Understanding split brain in a Galera cluster

Split brain is a condition when a cluster gets partitioned and each part is operating independently. This is an unwanted situation that one would always want to avoid. So, how is it handled in a MariaDB Galera cluster? In order to understand this, lets first start by looking into the logs of a node from a split-brain(ed) cluster.

50217 15:49:02 [Note] WSREP: (dc9e817d-b6e5-11e4-a36f-8fe330cf395b, 'tcp://0.0.0.0:4010') turning message relay requesting on, nonlive peers: tcp://192.168.0.8:4030 tcp://192.168.0.8:4040
150217 15:49:03 [Note] WSREP: (dc9e817d-b6e5-11e4-a36f-8fe330cf395b, 'tcp://0.0.0.0:4010') reconnecting to 370a81b6-b6e6-11e4-b985-beb180620fd9 (tcp://192.168.0.8:4030), attempt 0
150217 15:49:03 [Note] WSREP: (dc9e817d-b6e5-11e4-a36f-8fe330cf395b, 'tcp://0.0.0.0:4010') reconnecting to 4ba33092-b6e6-11e4-93a4-ba232dff3484 (tcp://192.168.0.8:4040), attempt 0
150217 15:49:04 [Note] WSREP: evs::proto(dc9e817d-b6e5-11e4-a36f-8fe330cf395b, GATHER, view_id(REG,370a81b6-b6e6-11e4-b985-beb180620fd9,4)) suspecting node: 370a81b6-b6e6-11e4-b985-beb180620fd9
150217 15:49:04 [Note] WSREP: evs::proto(dc9e817d-b6e5-11e4-a36f-8fe330cf395b, GATHER, view_id(REG,370a81b6-b6e6-11e4-b985-beb180620fd9,4)) suspecting node: 4ba33092-b6e6-11e4-93a4-ba232dff3484
...
...
150217 15:49:14 [Note] WSREP: evs::proto(dc9e817d-b6e5-11e4-a36f-8fe330cf395b, GATHER, view_id(REG,370a81b6-b6e6-11e4-b985-beb180620fd9,4)) detected inactive node: 370a81b6-b6e6-11e4-b985-beb180620fd9
150217 15:49:14 [Note] WSREP: evs::proto(dc9e817d-b6e5-11e4-a36f-8fe330cf395b, GATHER, view_id(REG,370a81b6-b6e6-11e4-b985-beb180620fd9,4)) detected inactive node: 4ba33092-b6e6-11e4-93a4-ba232dff3484
150217 15:49:15 [Note] WSREP: view(view_id(NON_PRIM,370a81b6-b6e6-11e4-b985-beb180620fd9,4) memb {
        d40d3354-b6e5-11e4-aed5-8e8f043f2e91,0
        dc9e817d-b6e5-11e4-a36f-8fe330cf395b,0
} joined {
} left {
} partitioned {
        370a81b6-b6e6-11e4-b985-beb180620fd9,0
        4ba33092-b6e6-11e4-93a4-ba232dff3484,0
})
150217 15:49:15 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 1, memb_num = 2
150217 15:49:15 [Note] WSREP: Flow-control interval: [23, 23]
150217 15:49:15 [Note] WSREP: Received NON-PRIMARY.
150217 15:49:15 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 0)
150217 15:49:15 [Note] WSREP: New cluster view: global state: d40f1ba3-b6e5-11e4-91f1-1b1c88707f31:0, view# -1: non-Primary, number of nodes: 2, my index: 1, protocol version 3
150217 15:49:15 [Note] WSREP: Setting wsrep_ready to 0

In the event of a network partition, some nodes of the cluster may no longer be reachable from the other nodes. They try to reconnect to these suspecting nodes and later move them to partitioned list by marking them as inactive when no response is received. A voting for quorum is then taken on each node to see if they belong to the majority partition (Primary Component) using the following formula :

where,

  • : members of the last seen primary component,
  • : members that are known to have left gracefully,
  • : current components members, and
  • : member’s weight

In a Galera cluster, nodes outside the primary component are not allowed to process queries. It is mainly done preserve data consistency.

MariaDB [test]> select 1;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use

Now, as shown in the logs above, when the cluster gets split into two partitions of equal size, (i.e. both the partitions get equal weight, split-brain), the quorum algorithm fails find the the primary component. As a result, the cluster has no primary component and can no longer process any queries. In this case, one has to find the node which has most recent updates and bootstrap the cluster using that node.

Reference: http://galeracluster.com/documentation-webpages/weightedquorum.html

New Information schema tables for Galera membership and status

MariaDB Galera server logs all the cluster related information like node status, cluster status, membership, etc. in the error log. MariaDB 10.1.2 introduces a new INFORMATION SCHEMA plugin WSREP_INFO that enables querying these information via INFORMATION SCHEMA tables. The WSREP_INFO plugin adds two new tables to the Information Schema, WSREP_MEMBERSHIP and WSREP_STATUS. The plugin is not enabled by default, so in order to use it, it needs to be installed first :

MariaDB [test]> INSTALL SONAME 'wsrep_status';
Query OK, 0 rows affected (0.04 sec)

MariaDB [test]> SHOW PLUGINS;
+-----------------------------+----------+--------------------+-----------------+---------+
| Name                        | Status   | Type               | Library         | License |
+-----------------------------+----------+--------------------+-----------------+---------+
...
| WSREP_MEMBERSHIP            | ACTIVE   | INFORMATION SCHEMA | wsrep_status.so | GPL     |
| WSREP_STATUS                | ACTIVE   | INFORMATION SCHEMA | wsrep_status.so | GPL     |
+-----------------------------+----------+--------------------+-----------------+---------+
52 rows in set (0.01 sec)

MariaDB [test]> SHOW CREATE TABLE INFORMATION_SCHEMA.WSREP_MEMBERSHIP\G
*************************** 1. row ***************************
       Table: WSREP_MEMBERSHIP
Create Table: CREATE TEMPORARY TABLE `WSREP_MEMBERSHIP` (
  `INDEX` int(11) NOT NULL DEFAULT '0',
  `UUID` varchar(36) NOT NULL DEFAULT '',
  `NAME` varchar(32) NOT NULL DEFAULT '',
  `ADDRESS` varchar(256) NOT NULL DEFAULT ''
) ENGINE=MEMORY DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

MariaDB [test]> SHOW CREATE TABLE INFORMATION_SCHEMA.WSREP_STATUS\G
*************************** 1. row ***************************
       Table: WSREP_STATUS
Create Table: CREATE TEMPORARY TABLE `WSREP_STATUS` (
  `NODE_INDEX` int(11) NOT NULL DEFAULT '0',
  `NODE_STATUS` varchar(16) NOT NULL DEFAULT '',
  `CLUSTER_STATUS` varchar(16) NOT NULL DEFAULT '',
  `CLUSTER_SIZE` int(11) NOT NULL DEFAULT '0',
  `CLUSTER_STATE_UUID` varchar(36) NOT NULL DEFAULT '',
  `CLUSTER_STATE_SEQNO` bigint(21) NOT NULL DEFAULT '0',
  `CLUSTER_CONF_ID` bigint(21) NOT NULL DEFAULT '0',
  `GAP` varchar(10) NOT NULL DEFAULT '',
  `PROTOCOL_VERSION` int(11) NOT NULL DEFAULT '0'
) ENGINE=MEMORY DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

MariaDB [test]> SELECT @@wsrep_provider;
+----------------------------------+
| @@wsrep_provider                 |
+----------------------------------+
| /usr/lib/galera/libgalera_smm.so |
+----------------------------------+
1 row in set (0.00 sec)

Now that WSREP_INFO plugin is installed, lets look into the contents of these tables on a 3-node cluster.

MariaDB [test]> SELECT * FROM INFORMATION_SCHEMA.WSREP_MEMBERSHIP;
+-------+--------------------------------------+----------+-----------------+
| INDEX | UUID                                 | NAME     | ADDRESS         |
+-------+--------------------------------------+----------+-----------------+
|     0 | 19058073-8940-11e4-8570-16af7bf8fced | my_node1 | 10.0.2.15:16001 |
|     1 | 19f2b0e0-8942-11e4-9cb8-b39e8ee0b5dd | my_node3 | 10.0.2.15:16003 |
|     2 | d85e62db-8941-11e4-b1ef-4bc9980e476d | my_node2 | 10.0.2.15:16002 |
+-------+--------------------------------------+----------+-----------------+
3 rows in set (0.00 sec)

MariaDB [test]> SELECT * FROM INFORMATION_SCHEMA.WSREP_STATUS\G
*************************** 1. row ***************************
         NODE_INDEX: 0
        NODE_STATUS: Synced
     CLUSTER_STATUS: Primary
       CLUSTER_SIZE: 3
 CLUSTER_STATE_UUID: 190604d7-8940-11e4-a084-ebee5211c190
CLUSTER_STATE_SEQNO: 2
    CLUSTER_CONF_ID: 3
                GAP: NO
   PROTOCOL_VERSION: 3
1 row in set (0.00 sec)

As seen above, WSREP_MEMBERSHIP table shows information about current members in the cluster which includes node’s name and incoming address. WSREP_STATUS table, on the other hand, shows status information about the node and cluster as a whole.

SHOW command can also be used to query these tables. Its quick and reduces the number of columns for WSREP_STATUS to fit to the screen.

MariaDB [test]> SHOW WSREP_MEMBERSHIP;
+-------+--------------------------------------+----------+-----------------+
| Index | Uuid                                 | Name     | Address         |
+-------+--------------------------------------+----------+-----------------+
|     0 | 19058073-8940-11e4-8570-16af7bf8fced | my_node1 | 10.0.2.15:16001 |
|     1 | 19f2b0e0-8942-11e4-9cb8-b39e8ee0b5dd | my_node3 | 10.0.2.15:16003 |
|     2 | d85e62db-8941-11e4-b1ef-4bc9980e476d | my_node2 | 10.0.2.15:16002 |
+-------+--------------------------------------+----------+-----------------+
3 rows in set (0.00 sec)

MariaDB [test]> SHOW WSREP_STATUS;
+------------+-------------+----------------+--------------+
| Node_Index | Node_Status | Cluster_Status | Cluster_Size |
+------------+-------------+----------------+--------------+
|          0 | Synced      | Primary        |            3 |
+------------+-------------+----------------+--------------+
1 row in set (0.00 sec)