When the source Redis instance fails, you need to perform a disaster recovery failover for the disaster recovery cluster first, and then switch the client's access address to the new disaster recovery instance.
In a disaster recovery cluster, when the source fails, the synchronization link with the failed source needs to be interrupted, and the target side is promoted to the source; to ensure that the client can write data and that the dirty data of the failed source will not pollute the target instance.
When a failure occurs at the source of the disaster recovery cluster, the Alauda Cache Service for Redis OSS on the target side will detect that the disaster recovery link with the source is interrupted. At this time, the status of the ActiveRedisConnection
resource will be abnormal, and the Web Console of the target instance will prompt that a disaster recovery switch can be performed.
When STATUS
is Failed
, it indicates that the connection between instance c6-dest
and the upstream is abnormal. You can view the yaml details to understand the specific situation:
After manually confirming that the source is indeed abnormal, you need to disconnect the target instance from the source's disaster recovery link to prevent the introduction of dirty data.
After deleting the corresponding ActiveRedisConnection resource, the Redis instance has become an independent source instance, and the client's access address can be safely switched to this instance for reading and writing.
After the failed source returns to normal, it can be re-added to the disaster recovery cluster as a target instance of the new source. For the operation method, refer to Setup Disaster Recovery.
Before connecting to the disaster recovery cluster, the client should support switching the Redis access address to the new source instance (the promoted target) after detecting the source failure. Usually, there are the following switching methods:
Architecture | Mechanism | Advantages | Disadvantages | Impact on RTO/RPO | Implementation Difficulty | Ideal Use Case |
---|---|---|---|---|---|---|
DNS Switch | Update DNS records to point to the new IP | Simple concept, platform-independent. | Long and uncontrollable RTO (depends on TTL), invalid for K8s internal traffic. | RTO: minutes RPO: may be high. | Low | Applications with low RTO requirements, or as a manual fallback solution. |
Proxy | The client connects to the proxy, and the proxy routes to the master node through health checks. | Fast switching (second-level RTO), transparent to the client, simple fault recovery. | Increases network hops and operation and maintenance burden, the proxy itself needs to be highly available. | RTO: seconds RPO: low. | Medium | Recommended solution: requires fast and transparent switching, and has the ability to operate and maintain a highly available proxy cluster. |
Service Mesh (Istio) | The Sidecar proxy intercepts traffic and performs local priority and cross-cluster switching based on policies. | Powerful functions, application transparency, simple fault recovery. | Extremely high operation and maintenance complexity, heavy technology stack. | RTO: seconds RPO: low. | High | Large and complex systems that have fully adopted service mesh to manage microservices. |
Client Library | The library has built-in logic and decides to switch to the target cluster by itself. | No middle layer, latency may be lower. | Extremely high risk: switching decisions are unreliable, fault recovery is complex (often requires restarting the business), and ecological support is inconsistent. | RTO: unpredictable RPO: high risk. | Medium | Not recommended for production-level automatic disaster recovery. |
Whether the client triggers a disaster recovery failover cannot be judged solely by a binary decision; failover is a multi-dimensional, high-confidence decision-making process. It needs to be handled in combination with multiple dimensions of fault detection, including but not limited to: the available status of the instance
, whether the instance still has high availability
, whether the k8s cluster has the possibility of continuing to serve
, data center availability detection
.
Usually, the following expression needs to be met before the client can safely perform a disaster recovery switch:
Alauda Cache Service for Redis OSS currently does not provide support for client-side disaster recovery switching. Customers need to implement a suitable client switching method according to their own infrastructure.