PMAD — Pool-based Memory Allocator
A slab allocator written in C that delivers O(1) allocation and deallocation with zero fragmentation and zero system calls at runtime. Every allocation the daemon makes — task envelopes, wire buffers, worker registration slots — comes out of PMAD. Fragmentation is 0 % by design: every block is pre-sized to a declared class, so there is no splitting, no coalescing, and no wasted space.
PMAD pre-allocates a contiguous pool of memory with a single mmap call at startup, then partitions it into user-defined size classes. Standard allocators (ptmalloc, jemalloc v5.3, tcmalloc v2026) optimise for average-case throughput — PMAD optimises for worst-case determinism and predictable latency budgets.
| Domain | Why PMAD fits |
|---|---|
| Real-time systems | Guaranteed O(1) response — no lock contention, no syscalls at runtime |
| Embedded / RTOS | Minimal footprint, no heap fragmentation, fully configurable memory layout |
| Game engines | Predictable frame-time budgets with zero allocation jitter |
| High-frequency trading | Nanosecond-class allocation latency under sustained throughput |
Architecture
Every allocation is a single lookup-table index followed by a free-list pop. Every deallocation is a free-list push keyed by the block’s own header. Both operations have no conditional branch paths — the fast path is the only path.
- Public API
- The thin facade in
incPMAD.h—pmad_init,pmad_alloc,pmad_free,pmad_destroy. This is the entire contract the daemon consumes. - Size Class Table
- A flat array
[MAX_SIZE / ALIGNMENT]maps a requested byte-count directly to the correct size-class descriptor — an O(1) table lookup, no branches. - Free Lists
- Each size class owns a singly-linked intrusive free list. A pop is a pointer dereference; a push is a pointer swap. No atomics on the fast path — the daemon serialises through its own router, so locks are structurally unnecessary.
- Memory Pool
- One
mmapregion split into contiguous runs of blocks, one run per class, sized by the user percentages. Each block carries a 16-byteBlockHeader(nextpointer + class ID) so deallocations need no external metadata.
Benchmarks
Measured on Apple Silicon (-O3 -march=native). Full benchmark source and reproduction instructions are available on github.com/anastassow/PMAD.
| Metric | Value |
|---|---|
| P50 allocation latency | 2.59 ns |
| P99.9 allocation latency | 6.50 ns |
| Latency vs block size | Flat — 2.59 ns P50 from 16B to 4096B (O(1) guarantee, demonstrated) |
| Peak throughput | 748.9 Mops/s @ 16B · 690.6 Mops/s @ 64B |
| Worst-case under churn (1024B) | ~40 µs (system allocator: 6.95 ms) |
| Fragmentation | 0 % |
| Runtime syscalls | Zero |
| Correctness | 19/19 tests pass |
Reference configurations
| Profile | Size classes (B) | Split (%) | Suitability |
|---|---|---|---|
| Max throughput | {16} | 100 | Small-object velocity |
| Min overhead | {4096} | 100 | Bulk data density |
| Balanced | {64, 256, 1024} | 60 / 30 / 10 | Mixed workloads |
| Latency-optimised | {32, 128} | 80 / 20 | Critical signalling |
| HFT / network | {32, 128, 512, …} | 60 / 20 / … | L3 packet processing |
| Embedded / RTOS | {8, 16, 32, …} | 30 / 30 / … | Deterministic control |
pmad_alloc is as fast as your first.Tear-down
A single munmap returns the entire pool to the OS in O(1) — there are no individual blocks to walk, no fragmented regions to compact. Shutdown is symmetric with startup: one syscall in, one syscall out.