How Ascend vNPU slicing works

Unlike NVIDIA vGPU — where any memory size can be requested — Huawei NPU virtualization is template-based. Each chip model has a fixed set of slice sizes (called templates) that the firmware accepts; the plugin and scheduler always round a memory request up to the closest template.

For the installation steps that set up the Ascend device plugin and the hami-scheduler-device ConfigMap referenced below, see Install for Huawei Ascend NPU.

The three numbers per chip model

The vnpus section of the hami-scheduler-device ConfigMap defines, for every chip model, three things the plugin and the scheduler need:

Field	Meaning
`memoryAllocatable`	Memory that one physical card exposes for slicing (in MiB).
`memoryCapacity`	Hardware ceiling. A request larger than `memoryAllocatable` but `≤ memoryCapacity` consumes the whole card.
`templates[]`	Allowed slice sizes (`memory`, `aiCore`, optional `aiCPU`). The plugin sorts them ascending by `memory` at startup.

For example, the default entry for Ascend910B4 is:

- chipName: 910B4
  commonWord: Ascend910B4
  resourceName: huawei.com/Ascend910B4
  resourceMemoryName: huawei.com/Ascend910B4-memory
  memoryAllocatable: 32768    # 32 GiB per card
  memoryCapacity: 32768
  aiCore: 20
  aiCPU: 7
  templates:
    - name: vir05_1c_8g        # 8 GiB slice, 5 aiCore, 1 aiCPU
      memory: 8192
      aiCore: 5
      aiCPU: 1
    - name: vir10_3c_16g       # 16 GiB slice, 10 aiCore, 3 aiCPU
      memory: 16384
      aiCore: 10
      aiCPU: 3

What you see on the node

For every physical card the plugin advertises memoryAllocatable / smallestTemplateMemory "slots" to kubelet. Each slot represents one potential slice on that card; the actual size of the slice is decided later, at scheduling time, by the requested memory.

Worked example — a node with 8 × Ascend 910B4 (32 GiB each) using the default config above:

Smallest template is vir05_1c_8g (8 GiB), so each card exposes 32768 / 8192 = 4 slots.

The node's status.allocatable only carries the slot count:

status.allocatable:
  huawei.com/Ascend910B4: 32   # 8 cards × 4 slots

The companion huawei.com/Ascend910B4-memory is not a kubelet extended resource and does not appear on the node. The plugin instead publishes every card's UUID, total memory, and aiCore in the hami.io/node-register-Ascend910B4 annotation; hami-scheduler reads that annotation to keep a per-card memory budget. A pod consumes one slot from status.allocatable and a chunk of the scheduler-tracked memory budget at the same time.

You can inspect the scheduler-side view with:

kubectl get node {ascend-node} -o jsonpath='{.metadata.annotations.hami\.io/node-register-Ascend910B4}'

How a memory request is rounded — `trimMemory`

When a pod is admitted, the HAMi webhook walks the templates from smallest to largest and picks the first template whose memory is ≥ the requested memory. That template defines how much memory and how many aiCore the slice actually uses.

templates (sorted): 8192, 16384

requested 2 GiB   -> 8 GiB slice (vir05_1c_8g),  5 aiCore
requested 6 GiB   -> 8 GiB slice (vir05_1c_8g),  5 aiCore
requested 10 GiB  -> 16 GiB slice (vir10_3c_16g), 10 aiCore
requested 20 GiB  -> whole card (32 GiB), all aiCore   # no template fits, but ≤ memoryCapacity
requested 40 GiB  -> rejected                           # > memoryCapacity

The webhook then rewrites the pod's huawei.com/Ascend910B4-memory request to the rounded value, so what you see in the pod spec after admission is the slice size, not the value you asked for.

Walk-through: 8 × 910B4, two pods on the same node

Initial state — every card is empty: 32 GiB free, 20 aiCore free.

Step	Pod requests	Template chosen	What happens
1	`Ascend910B4: 1`, `Ascend910B4-memory: 2048` (2 GiB)	`vir05_1c_8g` — 8 GiB / 5 aiCore	Scheduler binpack-picks card #0; pod's request is rewritten to `8192`. Card #0 now: 24 GiB free, 15 aiCore free. Node count drops to `31`.
2	`Ascend910B4: 1`, `Ascend910B4-memory: 10240` (10 GiB)	`vir10_3c_16g` — 16 GiB / 10 aiCore	Card #0 still has room (24 GiB free, 15 aiCore free) → scheduler reuses card #0. Card #0 now: 8 GiB free, 5 aiCore free. Node count drops to `30`.

A third pod requesting 1 GiB (rounds to 8 GiB / 5 aiCore) would just fit on card #0 and fill it; a fourth would force the scheduler onto card #1. With gpuSchedulerPolicy=spread, each new slice would instead start on a fresh card.

When the slice-bound containers start, the Ascend Docker Runtime sees the env vars the plugin injects:

ASCEND_VISIBLE_DEVICES=0          # physical card index
ASCEND_VNPU_SPECS=vir05_1c_8g     # template name

and asks the firmware to create the actual vNPU. When the pod terminates and the vNPU goes idle, the plugin's periodic cleanup (CleanupIdleVNPUs) destroys it on the next tick so the slot can be reused.

Things to keep in mind

No fractional templates. Asking for 2 GiB on a chip whose smallest template is 8 GiB still consumes 8 GiB and 5 aiCore. Pick a chip / template set that matches your typical workload.
Slicing is single-card only. huawei.com/Ascend910B4 > 1 is interpreted as "I want N whole cards" — the webhook rejects requests that combine a count > 1 with a memory request smaller than memoryAllocatable.
The ConfigMap is the source of truth. If your cards have a non-default memory size, or you need different template shapes, edit the vnpus entry for that chip in hami-scheduler-device and restart the hami-ascend-device-plugin pods. Changes are lost on the next chart upgrade.

#How Ascend vNPU slicing works

#TOC

#The three numbers per chip model

#What you see on the node

#How a memory request is rounded — trimMemory

#Walk-through: 8 × 910B4, two pods on the same node

#Things to keep in mind