§ 02.02Core Concepts

PMAD — Pool-based Memory Allocator

A slab allocator written in C that delivers O(1) allocation and deallocation with zero fragmentation and zero system calls at runtime. Every allocation the daemon makes — task envelopes, wire buffers, worker registration slots — comes out of PMAD. Fragmentation is 0 % by design: every block is pre-sized to a declared class, so there is no splitting, no coalescing, and no wasted space.

PMAD pre-allocates a contiguous pool of memory with a single mmap call at startup, then partitions it into user-defined size classes. Standard allocators (ptmalloc, jemalloc v5.3, tcmalloc v2026) optimise for average-case throughput — PMAD optimises for worst-case determinism and predictable latency budgets.

Domain	Why PMAD fits
Real-time systems	Guaranteed O(1) response — no lock contention, no syscalls at runtime
Embedded / RTOS	Minimal footprint, no heap fragmentation, fully configurable memory layout
Game engines	Predictable frame-time budgets with zero allocation jitter
High-frequency trading	Nanosecond-class allocation latency under sustained throughput

Architecture

Every allocation is a single lookup-table index followed by a free-list pop. Every deallocation is a free-list push keyed by the block’s own header. Both operations have no conditional branch paths — the fast path is the only path.

Public API: The thin facade in incPMAD.h — pmad_init, pmad_alloc, pmad_free, pmad_destroy. This is the entire contract the daemon consumes.
Size Class Table: A flat array [MAX_SIZE / ALIGNMENT] maps a requested byte-count directly to the correct size-class descriptor — an O(1) table lookup, no branches.
Free Lists: Each size class owns a singly-linked intrusive free list. A pop is a pointer dereference; a push is a pointer swap. No atomics on the fast path — the daemon serialises through its own router, so locks are structurally unnecessary.
Memory Pool: One mmap region split into contiguous runs of blocks, one run per class, sized by the user percentages. Each block carries a 16-byte BlockHeader (next pointer + class ID) so deallocations need no external metadata.

Benchmarks

Measured on Apple Silicon (-O3 -march=native). Full benchmark source and reproduction instructions are available on github.com/anastassow/PMAD.

Metric	Value
P50 allocation latency	2.59 ns
P99.9 allocation latency	6.50 ns
Latency vs block size	Flat — 2.59 ns P50 from 16B to 4096B (O(1) guarantee, demonstrated)
Peak throughput	748.9 Mops/s @ 16B · 690.6 Mops/s @ 64B
Worst-case under churn (1024B)	~40 µs (system allocator: 6.95 ms)
Fragmentation	0 %
Runtime syscalls	Zero
Correctness	19/19 tests pass

Reference configurations

Profile	Size classes (B)	Split (%)	Suitability
Max throughput	`{16}`	100	Small-object velocity
Min overhead	`{4096}`	100	Bulk data density
Balanced	`{64, 256, 1024}`	60 / 30 / 10	Mixed workloads
Latency-optimised	`{32, 128}`	80 / 20	Critical signalling
HFT / network	`{32, 128, 512, …}`	60 / 20 / …	L3 packet processing
Embedded / RTOS	`{8, 16, 32, …}`	30 / 30 / …	Deterministic control

PERFORMANCEWhat these numbers actually mean

Flat tail — PMAD moves only 2.5× from P50 to P99.9 (2.59 → 6.50 ns), the tightest spread of every allocator tested. jemalloc fans out 18.5× over the same range; the system allocator 15.3×. The remaining variance is OS scheduling noise, not allocator behaviour. Zero runtime syscalls means the kernel never interrupts an allocation — your millionth pmad_alloc is as fast as your first.

Tear-down

A single munmap returns the entire pool to the OS in O(1) — there are no individual blocks to walk, no fragmented regions to compact. Shutdown is symmetric with startup: one syscall in, one syscall out.