Blog

Handling RabbitMQ Network Partition in AWS

By September 5, 2019 July 22nd, 2020 No Comments

Handling RabbitMQ Network Partition in AWS

5 SEP, 2019

RabbitMQ clusters do not tolerate network partitions well. However, sometimes accidents happen. This post will help you understand how to detect network partitions, some of the bad effects that may happen during partitions, and how to recover from them. RabbitMQ stores information about queues, exchanges, bindings, etc in Erlang’s distributed database, Mnesia. Many of the details of what happens around network partitions are related to Mnesia’s behaviour.

Detecting Network Partitions

Mnesia will typically determine that a node is down if another node is unable to contact it for a minute. If two nodes come back into contact, both having thought the other is down, Mnesia will determine that a partition has occurred. This will be written to the RabbitMQ log in a form like:

=ERROR REPORT==== 15-Oct-2012::18:02:30 ===
Mnesia(rabbit@smacmullen): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network,
hare@smacmullen}

RabbitMQ nodes will record whether this event has ever occurred while the node is up, and expose this information through rabbitmq cluster status and the management plugin.

rabbitmq cluster_status will normally show an empty list for partitions:

rabbitmqctl cluster_status
# => Cluster status of node rabbit@smacmullen …
# => [{nodes,[{disc,[hare@smacmullen,rabbit@smacmullen]}]},
# => {running_nodes,[rabbit@smacmullen,hare@smacmullen]},
# => {partitions,[]}] # => …done

The management plugin API will return partition information for each node under partitions in /api/nodes. The management plugin UI will show a large red warning on the overview page if a partition has occurred.

Automatically handling partitions

By default, Rabbitmq uses ignore policy for handling network partitions. In the manual judgment needs to be taken by the cluster admin. Apart from this, there are 3 ways in which we can automatically handle the network partitions.

1. Pause Minority

In the pause-minority mode, RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. Therefore it chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of a network partition, at most the nodes in a single partition will continue to run. The minority nodes will pause as soon as a partition starts and will start again when the partition ends.

2. Pause-If-All-Down

In the pause-if-all-down mode, RabbitMQ will automatically pause cluster nodes which cannot reach any of the listed nodes. In other words, all the listed nodes must be down for RabbitMQ to pause a cluster node. This is close to the pause-minority mode, however, it allows an administrator to decide which nodes to prefer, instead of relying on the context. For instance, if the cluster is made of two nodes in rack A and two nodes in rack B and the link between racks is lost, the pause-minority mode will pause all nodes. In the pause-if-all-down mode, if the administrator listed the two nodes in rack A, only nodes in rack B will pause. Note that it is possible the listed nodes get split across both sides of a partition: in this situation, no node will pause.

3. Autoheal

In the autoheal mode, RabbitMQ will automatically decide on a winning partition if a partition is deemed to have occurred, and will restart all nodes that are not in the winning partition. Unlike pause_minority mode, therefore it takes effect when a partition ends, rather than when one starts. The winning partition is the one which has the most clients connected (or if this produces a draw, the one with the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).

Recommended Mode For AWS

For rabbitmq hosted in AWS with evenly distributed nodes across AZs, it is recommended to use pause_minority mode. In this during the network partition, rabbitmq pause all the nodes which are on the lower side. The partition divides the cluster into different islands of nodes. Such that in a cluster of size three witnessing network partition, the rabbitmq cluster might split into two islands. Island1 may consist of 2 nodes and island2 of one. So the pause_minority algorithms pause all the nodes in island2 and keep mnesia service running on them for checking cluster communication. All the nodes are kept pause until the network is completely restored.

The pause minority mode tries to keep up the consistency and availability of the cluster during the network partition. At any point in time, at least one node will always be running and serving the traffic. Also, this strategy tries to ensure that split-brain problem is not seen in the cluster. This helps us avoid the case where publisher and consumer are talking to a different node during the network partition. So in the rabbiting client of our applications, if we have the failover strategy of switching between rabbitmq nodes the application might never feel that cluster partition has occurred.

 POC Outcome Of Pause_Minority Mode

While testing pause_minority mode there were 2 scenarios that I discovered:

  1. When the master node was part of a network partition
  2. When only slave nodes were part of a network partition

Test Setup

  1. Enable pause_minority on all the nodes of the cluster 2.We can simulate cluster partition by dropping traffic between any of the 2 nodes by running below command:

 

2.We can simulate cluster partition by dropping traffic between any of the 2 nodes by running below command:

Scenario 1: When the master node was part of a network partition

  • Consider a cluster with 3 nodes

  • Network jitter occurs between the Master node and Slave1 node:

  • Master sees it is not been able to communicate with Slave1 but Slave2 sends the status to master that it can still see Slave2. RabbitMQ Master marks Slave1 as an inconsistent node. Master node overrules status response of Slave2 and considers it sending the wrong status of Slave1. Therefore it also marks Slave2 as an inconsistent node. The cluster gets divided into 2 islands:

  • In the above figure, there were 2 islands created. Island1 has Master node and Island2 has both the slave. Although island1 had master node the pause minority mode will choose to pause Island1 nodes as it has less number of nodes then island2. Re-Election of the new master will occur in Island2. The cluster will still be functioning but with 2 nodes.

  • Now consider the case that cluster in Island2 with 2 nodes is running fine with the newly elected master. But the jitter in the network continues. There we will see a special case where Slave2 is sending status to the new master that it can still see the old master node. But while trying a connection with the old master node, the new master node is getting errors. Therefore newly elected master will also ignore the status response of Slave2 and will mark it down.

  • This will cause the creation of new Island of nodes in the cluster. The pause_ minority will pause the island with less number of nodes. In this case, Island3 will be paused as it contains the Slave node. Once we restore the network link between the Old Master and New Master, the whole cluster will be recovered automatically.

  • Scenario 2: When only slave nodes were part of a network partition

  • Again, consider a cluster with 3 nodes:

    • Network jitter occurs between Slave1 and Slave2

In this case, also the cluster will be divided into 2 islands but this time the master node will be part of the winning partition. Therefore pause minority will pause on of the slave node and the cluster will keep functioning as it is with 2 nodes. This time re-election will not occur as the master was part of the winning partition.

  • Here Master node will see that it is being able to connect with Slave1 but Slave2 is reporting to Master that it cannot see Slave1 node. Master will mark Slave1 as down and will wait for Slave2 to recover its network connectivity. Once we restore the link between Slave1 and Slave2, Slave1 will be added in the cluster.

Author BioVaibhav Mani Tripathi is currently working at Tavisca Solutions as a Senior Cloud Engineer. His professional experiences include AWS, Docker, ECS, Kubernetes, automation using Chef and Terraform. He also has hands-on experience in ELK (Elasticsearch, Logstash and Kibana), Grafana, Redis, Cassandra, RabbitMQ, Consul, Aerospike and has a deep knowledge of their clustering behaviour.