Best Practices Guide
TOC
OverviewArchitecture SelectionSentinel ModeCluster ModeSelection GuideVersion SelectionResource PlanningKernel TuningMemory SpecificationsWhy Limit to 8GB?Memory Configuration Best PracticesCPU ResourcesMulti-threadingStorage PlanningCapacity PlanningPerformance RequirementsParameter ConfigurationBuilt-in TemplatesParameter UpdateModification ExamplesResource SpecsSentinel Mode SpecsCluster Mode SpecsSchedulingNode SelectionTaint TolerationAnti-AffinityCluster ModeSentinel ModeUser ManagementPermission ProfilesSecurity MechanismsSystem AccountProduction Best PracticesClient AccessTopology DiscoverySentinel ModeCluster ModeNetwork Access StrategiesSentinel ModeCluster ModeCode ExamplesClient Reliability Best PracticesObservability & OperationsBackup & SecurityUpgrade & ScalingUpgradeScaling NotesMonitoringBuilt-in MetricsKey Metrics & Alert RecommendationsTroubleshootingReferencesOverview
As the de facto standard for caching and key-value storage in cloud-native architectures, Redis handles core requirements for high-concurrency read/write operations and low latency. Running stateful Redis services in a Kubernetes containerized environment presents challenges distinct from traditional physical machine environments, including persistence stability, dynamic network topology changes, and resource isolation and scheduling.
This Best Practices document aims to provide a standardized reference guide for Redis deployments in production environments. It covers the full lifecycle management from architecture selection, resource planning, client integration to observability and operations. By following this guide, users can build an enterprise-class Redis data service that is High Availability (HA), High Performance, and Maintainability.
Architecture Selection
The Full Stack Cloud Native Open Platform offers two standard Redis management architectures based on customer business scale and SLA requirements:
Sentinel Mode
Positioning: Classic High Availability Architecture, suitable for small to medium-scale businesses.
Sentinel mode is based on Redis's native master-replica replication mechanism. By deploying independent Sentinel process groups to monitor the status of master and replica nodes, it automatically executes Failover and notifies clients when the master node fails.
- Pros: Simple architecture, mature operations, lower requirements for client protocols.
- Cons: Write capacity is limited to a single node; storage capacity cannot scale horizontally.
Cluster Mode
Positioning: Distributed Sharding Architecture, suitable for large-scale high-concurrency businesses.
Cluster mode automatically shards data across multiple nodes using Hash Slots, enabling horizontal scaling (Scale-out) of storage capacity and read/write performance.
- Pros: True high availability distributed storage, supports dynamic Resharding.
- Cons: Complex client protocol; specific multi-key commands (e.g.,
MGET) are restricted by Slot distribution.
Selection Guide
When selecting a Redis architecture, consider business requirements for availability, scalability, and complexity.
Recommendations:
- If data volume is small (fits in single node memory) and simplicity/stability is priority, Sentinel Mode is preferred.
- If data volume is massive or write pressure is extremely high and cannot be supported by a single node, choose Cluster Mode.
Version Selection
Alauda Cache Service for Redis OSS currently supports 5.0, 6.0, and 7.2 stable versions. All three versions have undergone complete automated testing and production verification.
For new deployments, we strongly recommend choosing Redis 7.2:
-
Lifecycle
5.0/6.0: Community versions are End of Life (EOL) and no longer receive new features or security patches. Recommended only for compatibility with legacy applications.7.2: As the current Long Term Support (LTS) version, it has the longest lifecycle, ensuring operational stability and security updates for years to come.
-
Compatibility
- Redis
7.2maintains high compatibility with5.0and6.0data commands. Most business code can migrate smoothly without modification. - Note: RDB persistence file format (v11) is not backward compatible (i.e., RDB generated by
7.2cannot be loaded by6.0), but this does not affect new services.
- Redis
-
Key Features
- ACL v2: Provides granular access control (Key-based permission selectors), significantly enhancing security in multi-tenant environments.
- Redis Functions: Introduces Server-side Scripting standards, resolving issues with Lua script loss and replication, keeping logic closer to data.
- Sharded Pub/Sub: Resolves network storm issues caused by Pub/Sub broadcasting in Cluster mode, significantly improving messaging scalability via sharding.
- Performance Optimization: Deep optimizations in data structures (especially Sorted Sets) and memory management provide higher throughput and lower latency.
For more details on Redis 7.2 features, please refer to the official Redis 7.2 Release Notes.
Resource Planning
Kernel Tuning
To ensure stability and high performance in production, the following kernel parameter optimizations are recommended at the Kubernetes node level:
-
Memory Allocation (
vm.overcommit_memory)- Recommended:
1 - Explanation: Setting to
1(Always) ensures the kernel allows memory allocation during Redis Fork operations (RDB snapshot/AOF rewrite), even if physical memory appears insufficient. This effectively prevents persistence failures due to allocation errors.
- Recommended:
-
Connection Queue (
net.core.somaxconn)- Recommended:
2048or higher - Explanation: Redis default tcp-backlog is 511. In high concurrency scenarios, system
net.core.somaxconnshould be increased to avoid dropping client connection requests.
- Recommended:
-
Transparent Huge Pages (THP)
- Action: Disable (
never) - Explanation: THP causes significant latency spikes during memory allocation in Redis, especially during Copy-on-Write (CoW) after Fork. It is recommended to disable this on the host or via startup scripts.
- Action: Disable (
Memory Specifications
Redis uses a snapshot mechanism to asynchronously replicate in-memory data to disk for long-term storage. This keeps Redis high-performing but carries a risk of data loss between snapshots.
In Kubernetes containerized environments, we recommend a tiered memory management strategy:
- ✅ Standard Specs (< 8GB): Strongly Recommended. Ensures extremely low Fork latency and fast failure recovery (RTO < 60s); the most robust production choice.
- ⚠️ High-Performance Specs (8GB - 16GB): Acceptable. Requires high-performance host and THP must be disabled. Fork is controllable but may cause ~100ms jitter under high load.
- ❌ High-Risk Specs (> 16GB): Not Recommended. Single point of failure impact is too large, and full synchronization can easily saturate network bandwidth. Recommend horizontal splitting into Cluster mode.
Why Limit to 8GB?
While single instances on physical machines often run 32GB+, the 8GB limit in cloud-native environments is based on the "Golden Rule" of these core technologies:
-
Fork Blocking & Page Table Copy
- Redis calls
fork()during RDB/AOF Rewrite. Although memory pages are CoW, Process Page Tables must be fully copied, blocking the main thread. - Estimation: 10GB memory ≈ 20MB page table ≈ 10~50ms blocking (depending on virtualization overhead). Exceeding 8GB increases blocking risk exponentially, impacting SLA.
- Redis calls
-
Failure Recovery Efficiency (RTO)
- Container restart loading RDB is a single-threaded CPU-bound task (object deserialization). Tests show loading 8GB data takes 30-50s (even with SSD). Maintaining 32GB could result in multi-minute start times, contradicting K8s "fast self-healing" philosophy.
Memory Configuration Best Practices
To avoid OOM (OOM Kill) during persistence due to memory expansion, strict adherence to these principles is required:
- Set MaxMemory: Do not set
maxmemoryto 100% of the container Memory Limit. Recommend setting to 70% ~ 80% of the Limit. - Reserve CoW Space: Redis Forks a child process during RDB/AOF Rewrite. If there are heavy write updates, OS Copy-on-Write mechanisms duplicate memory pages; in extreme cases, memory usage can double from 8GB to 16GB.
- Overcommit Config: Ensure host
vm.overcommit_memory = 1to allow kernel forks without requesting equivalent physical memory (relying on CoW), preventing fork failures.
Resource Reservation Formula: Container_Memory_Limit ≈ Redis_MaxMemory / 0.7
- Example: To store 8GB data, configure Container Memory Limit to 10GB ~ 12GB, leaving 2GB+ for CoW and fragmentation overhead.
CPU Resources
Redis core command execution is single-threaded, but persistence (Fork) and other operations require child processes. Therefore, allocate at least 2 Cores per Redis instance:
- Core 1: Handles main thread requests and commands.
- Core 2: Handles persistence fork, background tasks, and system overhead.
Multi-threading
Redis 6.0+ introduced multi-threaded I/O (disabled by default) to overcome single-thread network I/O bottlenecks.
-
When to Enable?
- Bottleneck Analysis: When Redis CPU usage nears 100% and analysis shows time spent on Kernel State Network I/O (System CPU) rather than user-space command execution.
- Traffic Profile: Typically beneficial when single instance QPS > 80,000 or network traffic is huge (> 1GB/s).
- Resource Conditions: Ensure node has sufficient CPU cores (at least 4 cores).
-
Configuration Best Practices:
- Thread Count: Recommend 4~8 I/O threads. Exceeding 8 threads rarely yields significant gain.
- Config Example:
- Note: Multi-threaded I/O only improves network throughput; it does NOT improve execution speed of single complex commands (e.g.,
SORT,KEYS).
Storage Planning
Capacity Planning
Persistence mode directly determines disk quota requirements. Refer to the following calculation formula:
Performance Requirements
- With AOF: Disk performance is critical. Insufficient IOPS or high fsync latency will directly block the main thread (when
appendfsync everysec). - Media: Production environments strongly recommend SSD/NVMe local disks or high-performance cloud disks.
Parameter Configuration
Alauda Cache Service for Redis OSS parameters are specified via Custom Resource (CR) fields.
Built-in Templates
Alauda Cache Service for Redis OSS provides multiple parameter templates for different business scenarios. Selection depends on the trade-off between persistence (Diskless/AOF/RDB) and performance.
<version>represents Redis version, e.g.,6.0,7.2.
Key parameter differences:
Persistence Selection Recommendations
- Pure Cache: Choose Diskless Template. Data rebuildable, no overhead, best performance.
- General Business: Choose RDB Template. Periodic snapshots provide minute-level RPO, moderate resource usage.
- Financial/High-Reliability: Choose AOF Template with
appendfsync everysecfor second-level protection.
Redis supports running RDB and AOF together, but it is generally not recommended in Kubernetes:
- Performance: AOF fsync creates IO pressure; adding RDB fork + disk write significantly increases resource contention.
- Storage Doubling: Requires space for both RDB snapshots and AOF files, complicating PVC planning.
- Recovery Priority: Redis loads AOF first on start (more complete data); RDB acts only as backup, offering limited benefit.
- Platform Backup: Alauda Cache Service for Redis OSS provides independent auto/manual backup, removing reliance on RDB snapshots for extra insurance.
Recommendation: Choose Single Persistence Mode (RDB or AOF) based on needs, and use platform backup for disaster recovery. If mixed mode is necessary, ensure sufficient Storage IOPS (SSD) and reserve 5x data volume disk space.
Parameter Update
Redis parameters are categorized by application method:
Always assume data backup before modifying parameters requiring restart.
Modification Examples
Update Data Node Parameters: Configure via spec.customConfig.
Update Sentinel Node Parameters: Configure via spec.sentinel.monitorConfig.
Currently supports
down-after-milliseconds,failover-timeout,parallel-syncs.
Resource Specs
Deploy resources according to your actual business scenario.
Sentinel Mode Specs
Cluster Mode Specs
<version>represents Redis version, e.g.,6.0,7.2.
Scheduling
Alauda Cache Service for Redis OSS offers flexible scheduling strategies, supporting node selection, taint toleration, and various anti-affinity configurations to meet high availability needs in different resource environments.
Node Selection
You can use the spec.nodeSelector field to specify which nodes Redis Pods should be scheduled on. This is typically used with Kubernetes Node Labels to isolate database workloads to dedicated node pools.
Persistence Limitation: If your Redis instance mounts Non-Network Storage (e.g., Local PV) PVCs, be cautious when updating nodeSelector. Since local data resides on specific nodes and cannot migrate with Pods, the updated nodeSelector set MUST include the node where the Pod currently resides. If the original node is excluded, the Pod will fail to access data or start. Network storage (Ceph RBD, NFS) follows the Pod and is not subject to this restriction.
Taint Toleration
Use spec.tolerations to allow Redis Pods to tolerate node Taints. This allows deploying Redis on dedicated nodes with specific taints (e.g., key=redis:NoSchedule), preventing other non-critical workloads from preempting resources.
Anti-Affinity
To prevent single points of failure, Alauda Cache Service for Redis OSS provides anti-affinity configuration. Configuration differs by architecture mode.
Immutable: To ensure consistency and reliability, anti-affinity configurations (both affinityPolicy and affinity) cannot be modified after instance creation. Please plan ahead.
Cluster Mode
In Cluster mode, the system prioritizes spec.affinityPolicy. Alauda Cache Service for Redis OSS uses this enum to abstract complex topology rules, automatically generating affinity rules for each shard's StatefulSet.
- Priority:
spec.affinityPolicy>spec.affinity. - If
affinityPolicyis unset: Alauda Cache Service for Redis OSS checksspec.affinity. If you need custom topology rules beyond the enums below, leaveaffinityPolicyempty and configure nativespec.affinity.
Sentinel Mode
Important Sentinel Mode does not support
spec.affinityPolicy.
For Sentinel mode, Redis Data Nodes and Sentinel Nodes require separate Kubernetes native Affinity rules:
- Redis Data Nodes: Configured via
spec.affinity. - Sentinel Nodes: Configured via
spec.sentinel.affinity.
You need to manually write complete Affinity rules. Example for forcing anti-affinity for both Data and Sentinel nodes:
To force anti-affinity across ALL nodes (Data + Sentinel), refer to:
User Management
Alauda Cache Service for Redis OSS (v6.0+) provides declarative user management via RedisUser CRD, supporting ACLs.
Compatibility: Redis 5.0 only supports single-user auth; Redis 6.0+ implements full ACLs for multi-user/granular control.
Permission Profiles
The platform pre-defines permission profiles for common scenarios:
For custom ACLs, see Redis ACL Documentation.
Security Mechanisms
- ACL Force Revocation: All
RedisUsercreation/updates undergo Webhook validation to force removeaclpermissions, preventing privilege escalation. - Cluster Command Injection: For Cluster Mode, Alauda Cache Service for Redis OSS automatically injects topology commands:
cluster|slots,cluster|nodes,cluster|info,cluster|keyslot,cluster|getkeysinslot,cluster|countkeysinslotto ensure client awareness. - 6.0 -> 7.2 Upgrade Compatibility: When upgrading 6.0 -> 7.2, the operator adds
&*(Pub/Sub Channel) permission to ensure consistency with 7.x's new Channel ACLs.
System Account
Each Redis instance automatically generates a system account named operator. Its roles include:
- Cluster Init: Slot assignment, node joining.
- Config Simplification: Unified system account reduces user configuration complexity.
- Operations: Used for health checks, failovers, scaling.
- Avoid Restarts: Password updates for business users don't affect this account, avoiding restarts.
- Complexity: Random 64-char string (alphanumeric+special).
- Privilege: Highest level (includes user management).
- Restriction: No online password update and DO NOT manually modify/delete, as it may cause irreversible failure.
Production Best Practices
- App Isolation: Create independent user accounts for each app/microservice. Avoid sharing accounts to enable auditing and isolation.
- Principle of Least Privilege:
- Read-Only App: Use
ReadOnly. - Read-Write App: Use
ReadWrite. - Ops Tools: Use
NotDangerousor custom permissions. - Avoid
Administrator: Unless absolutely necessary.
- Read-Only App: Use
- Key Namespace Isolation: Combine ACL Key patterns (e.g.,
~app1:*) to restrict apps to specific key prefixes. - Password Rotation: Establish mechanisms to regularly rotate app passwords.
For operation steps, see User Management Docs.
Client Access
Topology Discovery
Both Sentinel and Cluster modes rely on clients actively discovering and connecting to data nodes, differing from traditional LB proxy modes:
Sentinel Mode
- Client connects to Sentinel Node.
- Client sends
SENTINEL get-master-addr-by-name mymasterto get Master IP/Port. - Client directly connects to Master.
- On failover, Sentinel notifies client (or client polls) to switch to new Master.
Cluster Mode
- Client connects to any Cluster Node.
- Sends
CLUSTER SLOTS/CLUSTER NODESto get Slot Distribution. - Calculates hash slot for Key and directly connects to target node.
- If slot migrates, node returns
MOVED/ASK; client must refresh topology.
Both protocols return Real Node IPs. If a reverse proxy (HAProxy/Nginx) is used, clients still get backend real IPs, which may be unreachable from outside the cluster. Thus, Each Redis Pod needs an independent external address (NodePort/LoadBalancer), not a single proxy address.
Network Access Strategies
Alauda Cache Service for Redis OSS supports multiple access methods:
Sentinel Mode
Cluster Mode
- Port Management: Range limited (30000-32767), conflicts easy in multi-instance.
- Security: Increases attack surface.
- Multi-NIC: Redis binds default NIC; clients may fail to connect if IPs mismatch.
- No LB Proxy: Sentinel/Cluster protocols require direct node connection; cannot be proxied by standard LBs.
- Sentinel (1P1R + 3 Sentinels): Needs 8 NodePorts/LBs.
- Cluster (3 Shards x 1P1R): Needs 7 NodePorts/LBs.
Code Examples
We provide best practice examples for go-redis, Jedis, Lettuce, and Redisson:
- Sentinel Access: How to Access Sentinel Instance
- Cluster Access: How to Access Cluster Instance
Master Group Name: In Sentinel mode, the master name is fixed to mymaster.
Client Reliability Best Practices
-
Timeouts
- Connect Timeout: distinct from Read Timeout. Recommend 1-3s.
- Read/Write Timeout: Based on SLA, usually hundreds of ms.
-
Retry Strategy
- Exponential Backoff: Do not retry immediately on failure; use backoff (100ms, 200ms...) to avoid retry storms.
-
Connection Pooling
- Reuse: Always use pooling (JedisPool, go-redis Pool) to save handshake costs.
- Max Connections: Set
MaxTotalreasonably to avoid hitting Redismaxclients.
-
Topology Refresh (Cluster)
- Auto-refresh: Ensure client enables
MOVED/ASKhandling. - Periodic refresh: In unstable/scaling environments, configure periodic refresh (e.g., 60s) to proactively detect changes.
- Auto-refresh: Ensure client enables
Observability & Operations
Backup & Security
The platform Backup Center provides convenient data management. You can backup instances, manage them centrally, and support S3 offloading. Support for restoring history to specific instances.
See Backup & Restore.
Upgrade & Scaling
Upgrade
See Upgrade.
Scaling Notes
When changing specs (CPU/Mem) or expanding:
- Assess Resources: Ensure cluster has capacity.
- Progressive: Rolling updates to minimize interruption.
- Off-peak: Execute during low traffic.
When reducing replicas or specs, ensure current data/load fits new specs to avoid data loss/crash.
Monitoring
Alauda Cache Service for Redis OSS has built-in metrics integrated with Prometheus.
Built-in Metrics
Variables {{.namespace}} and {{.name}} should be replaced with actual values.
Key Hit Rate
- Desc: Cache hit rate.
- Unit: %
- Expr:
Average Response Time
- Desc: Avg command latency. High = slow queries/bottleneck.
- Unit: s
- Expr:
Role Switching
- Desc: Master-Replica switches in 5m. Non-zero = failover occurred.
- Unit: Count
- Expr:
Instance Status
- Desc: Health status. 0 = Abnormal.
- Expr:
Node Input Bandwidth
- Desc: Peak ingress traffic.
- Unit: Bps
- Expr:
Node Output Bandwidth
- Desc: Peak egress traffic.
- Unit: Bps
- Expr:
Node Connections
- Desc: Peak client connections. Watch if near
maxclients. - Unit: Count
- Expr:
CPU Usage
- Desc: Node CPU usage. Sustained high = perf impact.
- Unit: %
- Expr:
Memory Usage
- Desc: Node memory usage. >80% suggest scaling.
- Unit: %
- Expr:
Storage Usage
- Desc: PVC usage. Full = persistence failure.
- Unit: %
- Expr:
Key Metrics & Alert Recommendations
Recommended production alerts:
Troubleshooting
For specific issues, search the Customer Portal.