The curious case of lost data

There is no doubt that Galera makes a fine clustering solution. The ease at which one can setup a cluster is incredible. Recently, a query from a user on #mariadb@freenode, however, revealed a caveat which I feel is quite important and worth sharing.

Sep 10 14:16:09 <____> nirbhay: executive summary, my cluster was out of sync, and when i brought it back into sync the DB is now 3 weeks old. it updated the new node with the old node’s contents.
Sep 10 14:16:23 <____> nirbhay: galera + maria 5.5

Let me first try to explain the problem. The user has a MariaDB Galera cluster – up and running – but with stale data (3 weeks old to be precise) and another MariaDB Galera node running outside the cluster storing latest updates. Now, the user takes this node and adds to the cluster, thinking that out-of-sync cluster nodes would receive latest changes from this joining node. Unfortunately, it turned out that opposite happened – and this joiner node instead became stale (losing 3 weeks worth of updates!) when into synced with the cluster.

Actually, this is expected. When a node joins an existing Galera cluster, it receives data from one of the nodes (donor) of the cluster in order to get in sync with the cluster. During this process (aka snapshot state transfer) the node itself loses it own data (if any). This is precisely the reason why the joiner node (with latest updates) lost the recent changes in the above mentioned scenario.

So, how to handle such situations? In cases like this, one should discard the existing stale cluster and bootstrap a new cluster using the node having latest updates.