RabbitMQ Mnesia Database Exception Handling

Common Mnesia Database Failures

RabbitMQ uses the Mnesia database to store information such as queues, exchanges, and bindings. The common causes of Mnesia failures can be categorized into two types:

  • Permission issues with the MNESIA_BASE directory, where the user lacks sufficient write permissions to that directory.
  • Mnesia fails to read the table.

Permission Issues

When you encounter permission issues with the Mnesia database, simply configure the write permissions for the current user on the MNESIA_BASE directory to resolve the problem.

Mnesia Table Read Failure

Mnesia creates the corresponding database schema based on the machine's hostname. Therefore, when the hostname changes, Mnesia cannot load the old schema. Similarly, if the rabbit@hostname directory is renamed, Mnesia will also be unable to locate the old database files, prompting it to create a new rabbit@hostname folder and start the database anew.

To fix Mnesia read failures, you can follow these steps:

  1. First, check if there have been any updates to the node's hostname or the rabbit@hostname directory name. If there have been updates, you can rename it following the convention of “rabbit@hostname,” which will allow you to see the old database files after renaming.

  2. If no updates are found, in cluster mode, the startup may fail due to unsuccessful connections among replicas. You can investigate communication issues between replicas based on the logs.

  3. If the above still does not restore data, you can choose to back up the current cluster's configuration, redeploy new nodes to join the original cluster, and import the backed-up configuration.

If RabbitMQ fails to form a new cluster properly after a restart, you may encounter errors in the log like the following:

[warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@e2e-rabbitmq-server-2.e2e-rabbitmq-nodes.local-midautons','rabbit@e2e-rabbitmq-server-1.e2e-rabbitmq-nodes.local-midautons','rabbit@e2e-rabbitmq-server-0.e2e-rabbitmq-nodes.local-midautons'],[rabbit_durable_queue]} [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 7 retries left

This issue occurs because information about the cluster nodes is stored in the Mnesia database, and the state information is not properly cleared during RabbitMQ's restart. Due to RabbitMQ's sequential startup policy for Pods, this may result in the initially started Pods waiting indefinitely for other Pods to complete their startup, creating a deadlock. The recommended solution is to forcefully delete the cluster state and restart it. You can execute the following command in kubectl:

kubectl -n {namespace name} exec {instance name} -- rabbitmqctl force_boot