Paddle Autogrow Memory Allocation Crash on GPU-Manager

Problem Description

Symptoms

When both PaddlePaddle's Autogrow memory allocation strategy and GPU-Manager's virtualized memory management are enabled simultaneously, the following anomalies may occur:

  1. OOM errors due to discontinuous memory allocation
  2. Abnormal GPU utilization fluctuations
  3. Random training process crashes
  4. Inconsistent memory usage between nvidia-smi reports and framework statistics

Root Cause

Root Cause Analysis

  1. Memory Allocation Strategy Conflict Paddle's Autogrow uses dynamic segmented allocation while GPU-Manager's virtualization requires contiguous physical memory mapping

  2. Management Mechanism Incompatibility Autogrow's delayed release mechanism conflicts with GPU-Manager's memory reclamation strategy

  3. Metadata Maintenance Conflict Separate metadata maintenance by both systems causes inconsistent memory views

Trigger Mechanism:

  • Autogrow attempts optimal block sizing during allocation
  • GPU-Manager virtualization layer intercepts physical memory requests
  • Non-contiguous allocations cause virtual address mapping failures
  • Dual management leads to metadata consistency exceptions

Solution

Solution Overview

Force Paddle to use traditional allocation strategy via environment variable:

FLAGS_allocator_strategy=naive_best_fit

Considerations

  1. Requires training process restart
  2. May reduce Paddle's memory reuse efficiency

Implementation Steps

Kubernetes Deployment

  1. Edit Deployment configuration
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      ◦ name: paddle-container
        env:
        ▪ name: FLAGS_allocator_strategy
          value: "naive_best_fit"
  1. Apply configuration
kubectl apply -f updated_deployment.yaml
  1. Verify configuration
kubectl exec <pod-name> -- env | grep FLAGS

Bare Metal Deployment

  1. Set environment variable before execution
export FLAGS_allocator_strategy=naive_best_fit
python train.py
  1. Or set in Python code
import os
os.environ['FLAGS_allocator_strategy'] = 'naive_best_fit'

Verification Methods

  1. Check allocation strategy confirmation in logs
I0715 14:25:17.112233 12345 allocator.cc:256]
Using Naive Best Fit allocation strategy
  1. Monitor memory allocation continuity
nvidia-smi --query-gpu=memory.used --format=csv -l 1
  1. Stress test validation
# Continuous allocation test script
import paddle
for i in range(10):
    data = paddle.randn([1024, 1024, 100], dtype='float32')
    print(f"Allocated {i+1}GB")

Preventive Measures

  1. Version Compatibility Check Review Paddle release notes for memory allocation changes during upgrades

  2. Monitoring Configuration Add Prometheus alert rule:

    • alert: GPUAllocConflict
      expr: rate(paddle_gpu_malloc_failed_total[5m]) > 0
      labels:
        severity: critical
      annotations:
        summary: "GPU Memory Allocation Conflict Alert"
    
  3. Baseline Testing Perform memory allocation baseline tests for new environments:

    python -c "import paddle; paddle.utils.run_check()"
    

Memory Allocation Strategy Comparison

StrategyAdvantagesDisadvantages
autogrowHigh efficiencyPoor large-block perf
naive_best_fitStable allocationPotential fragmentation

References

Paddle Memory Optimization Whitepaper