Paddle Autogrow Memory Allocation Crash on GPU-Manager

TOC

Problem Description

Symptoms

When both PaddlePaddle's Autogrow memory allocation strategy and GPU-Manager's virtualized memory management are enabled simultaneously, the following anomalies may occur:

  1. OOM errors due to discontinuous memory allocation
  2. Abnormal GPU utilization fluctuations
  3. Random training process crashes
  4. Inconsistent memory usage between nvidia-smi reports and framework statistics

Root Cause

Root Cause Analysis

  1. Memory Allocation Strategy Conflict Paddle's Autogrow uses dynamic segmented allocation while GPU-Manager's virtualization requires contiguous physical memory mapping

  2. Management Mechanism Incompatibility Autogrow's delayed release mechanism conflicts with GPU-Manager's memory reclamation strategy

  3. Metadata Maintenance Conflict Separate metadata maintenance by both systems causes inconsistent memory views

Trigger Mechanism:

  • Autogrow attempts optimal block sizing during allocation
  • GPU-Manager virtualization layer intercepts physical memory requests
  • Non-contiguous allocations cause virtual address mapping failures
  • Dual management leads to metadata consistency exceptions

Solution

Solution Overview

Force Paddle to use traditional allocation strategy via environment variable:

FLAGS_allocator_strategy=naive_best_fit

Considerations

  1. Requires training process restart
  2. May reduce Paddle's memory reuse efficiency

Implementation Steps

Kubernetes Deployment

  1. Edit Deployment configuration

    apiVersion: apps/v1
    kind: Deployment
    spec:
      template:
        spec:
          containers:
            - name: paddle-container
              env:
                - name: FLAGS_allocator_strategy
                  value: 'naive_best_fit'
  2. Apply configuration

    kubectl apply -f updated_deployment.yaml
  3. Verify configuration

    kubectl exec <pod-name> -- env | grep FLAGS

Bare Metal Deployment

  1. Set environment variable before execution

    export FLAGS_allocator_strategy=naive_best_fit
    python train.py
  2. Or set in Python code

    import os
    os.environ['FLAGS_allocator_strategy'] = 'naive_best_fit'

Verification Methods

  1. Check allocation strategy confirmation in logs

    I0715 14:25:17.112233 12345 allocator.cc:256]
    Using Naive Best Fit allocation strategy
  2. Monitor memory allocation continuity

    nvidia-smi --query-gpu=memory.used --format=csv -l 1
  3. Stress test validation

    # Continuous allocation test script
    import paddle
    for i in range(10):
        data = paddle.randn([1024, 1024, 100], dtype='float32')
        print(f"Allocated {i+1}GB")

Preventive Measures

  1. Version Compatibility Check Review Paddle release notes for memory allocation changes during upgrades

  2. Monitoring Configuration Add Prometheus alert rule:

    • alert: GPUAllocConflict
      expr: rate(paddle_gpu_malloc_failed_total[5m]) > 0
      labels:
        severity: critical
      annotations:
        summary: "GPU Memory Allocation Conflict Alert"
  3. Baseline Testing Perform memory allocation baseline tests for new environments:

    python -c "import paddle; paddle.utils.run_check()"

Memory Allocation Strategy Comparison

StrategyAdvantagesDisadvantages
autogrowHigh efficiencyPoor large-block perf
naive_best_fitStable allocationPotential fragmentation

References

Paddle Memory Optimization Whitepaper