Paddle Autogrow Memory Allocation Crash on GPU-Manager
TOC
Problem Description
Symptoms
When both PaddlePaddle's Autogrow memory allocation strategy and GPU-Manager's virtualized memory management are enabled simultaneously, the following anomalies may occur:
- OOM errors due to discontinuous memory allocation
- Abnormal GPU utilization fluctuations
- Random training process crashes
- Inconsistent memory usage between nvidia-smi reports and framework statistics
Root Cause
Root Cause Analysis
-
Memory Allocation Strategy Conflict Paddle's Autogrow uses dynamic segmented allocation while GPU-Manager's virtualization requires contiguous physical memory mapping
-
Management Mechanism Incompatibility Autogrow's delayed release mechanism conflicts with GPU-Manager's memory reclamation strategy
-
Metadata Maintenance Conflict Separate metadata maintenance by both systems causes inconsistent memory views
Trigger Mechanism:
- Autogrow attempts optimal block sizing during allocation
- GPU-Manager virtualization layer intercepts physical memory requests
- Non-contiguous allocations cause virtual address mapping failures
- Dual management leads to metadata consistency exceptions
Solution
Solution Overview
Force Paddle to use traditional allocation strategy via environment variable:
Considerations
- Requires training process restart
- May reduce Paddle's memory reuse efficiency
Implementation Steps
Kubernetes Deployment
-
Edit Deployment configuration
-
Apply configuration
-
Verify configuration
Bare Metal Deployment
-
Set environment variable before execution
-
Or set in Python code
Verification Methods
-
Check allocation strategy confirmation in logs
-
Monitor memory allocation continuity
-
Stress test validation
Preventive Measures
-
Version Compatibility Check Review Paddle release notes for memory allocation changes during upgrades
-
Monitoring Configuration Add Prometheus alert rule:
-
Baseline Testing Perform memory allocation baseline tests for new environments: