How Ascend vNPU slicing works
Unlike NVIDIA vGPU — where any memory size can be requested — Huawei NPU virtualization is template-based. Each chip model has a fixed set of slice sizes (called templates) that the firmware accepts; the plugin and scheduler always round a memory request up to the closest template.
For the installation steps that set up the Ascend device plugin and the hami-scheduler-device ConfigMap referenced below, see Install for Huawei Ascend NPU.
TOC
The three numbers per chip modelWhat you see on the nodeHow a memory request is rounded —trimMemoryWalk-through: 8 × 910B4, two pods on the same nodeThings to keep in mindThe three numbers per chip model
The vnpus section of the hami-scheduler-device ConfigMap defines, for every chip model, three things the plugin and the scheduler need:
For example, the default entry for Ascend910B4 is:
What you see on the node
For every physical card the plugin advertises memoryAllocatable / smallestTemplateMemory "slots" to kubelet. Each slot represents one potential slice on that card; the actual size of the slice is decided later, at scheduling time, by the requested memory.
Worked example — a node with 8 × Ascend 910B4 (32 GiB each) using the default config above:
-
Smallest template is
vir05_1c_8g(8 GiB), so each card exposes32768 / 8192 = 4slots. -
The node's
status.allocatableonly carries the slot count:
The companion huawei.com/Ascend910B4-memory is not a kubelet extended resource and does not appear on the node. The plugin instead publishes every card's UUID, total memory, and aiCore in the hami.io/node-register-Ascend910B4 annotation; hami-scheduler reads that annotation to keep a per-card memory budget. A pod consumes one slot from status.allocatable and a chunk of the scheduler-tracked memory budget at the same time.
You can inspect the scheduler-side view with:
How a memory request is rounded — trimMemory
When a pod is admitted, the HAMi webhook walks the templates from smallest to largest and picks the first template whose memory is ≥ the requested memory. That template defines how much memory and how many aiCore the slice actually uses.
The webhook then rewrites the pod's huawei.com/Ascend910B4-memory request to the rounded value, so what you see in the pod spec after admission is the slice size, not the value you asked for.
Walk-through: 8 × 910B4, two pods on the same node
Initial state — every card is empty: 32 GiB free, 20 aiCore free.
A third pod requesting 1 GiB (rounds to 8 GiB / 5 aiCore) would just fit on card #0 and fill it; a fourth would force the scheduler onto card #1. With gpuSchedulerPolicy=spread, each new slice would instead start on a fresh card.
When the slice-bound containers start, the Ascend Docker Runtime sees the env vars the plugin injects:
and asks the firmware to create the actual vNPU. When the pod terminates and the vNPU goes idle, the plugin's periodic cleanup (CleanupIdleVNPUs) destroys it on the next tick so the slot can be reused.
Things to keep in mind
- No fractional templates. Asking for 2 GiB on a chip whose smallest template is 8 GiB still consumes 8 GiB and 5 aiCore. Pick a chip / template set that matches your typical workload.
- Slicing is single-card only.
huawei.com/Ascend910B4 > 1is interpreted as "I want N whole cards" — the webhook rejects requests that combine a count > 1 with a memory request smaller thanmemoryAllocatable. - The ConfigMap is the source of truth. If your cards have a non-default memory size, or you need different template shapes, edit the
vnpusentry for that chip inhami-scheduler-deviceand restart thehami-ascend-device-pluginpods. Changes are lost on the next chart upgrade.