Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions reference/configuration/options.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,10 @@ storage:
- `reclamation.threshold` — Free-space ratio below which reclamation begins evicting from caching tables; _Default_: `0.4` (Added in: v4.5.0)
- `reclamation.interval` — Free-space check interval; _Default_: `1h`
- `reclamation.evictionFactor` — Heuristic factor for early eviction under disk pressure; _Default_: `100000`. See [Storage Tuning — Reclamation](../database/storage-tuning.md#storage-reclamation)
- `rocks.blockCacheSize` — RocksDB shared block cache size in bytes; _Default_: 25% of constrained memory. See [Storage Tuning — RocksDB Memory](../database/storage-tuning.md#rocksdb-memory) (Added in: v5.1.0)
- `rocks.writeBufferManagerSize` — Process-wide cap (bytes) on RocksDB memtable memory across all databases. `0` disables; _Default_: one third of `blockCacheSize` (enabled). See [Storage Tuning — RocksDB Memory](../database/storage-tuning.md#rocksdb-memory) (Added in: v5.1.0)
- `rocks.writeBufferManagerCostToCache` — Charge memtable memory against the block cache so both share a single accounting pool; _Default_: `true`. Has no effect when `writeBufferManagerSize` is `0`. (Added in: v5.1.0)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "charge" is kinda confusing. It steps from the description in rocksdb-js which also is confusing.

Suggested change
- `rocks.writeBufferManagerCostToCache`Charge memtable memory against the block cache so both share a single accounting pool; _Default_: `true`. Has no effect when `writeBufferManagerSize` is `0`. (Added in: v5.1.0)
- `rocks.writeBufferManagerCostToCache`When enabled, memtable memory and the block cache share a single, unified memory pool. During heavy write bursts, the block cache dynamically shrinks to give memtables more room. Once those memtables flush to disk, the block cache automatically reclaims that memory space. _Default_: `true`. Has no effect when `writeBufferManagerSize` is `0`. (Added in: v5.1.0)

- `rocks.writeBufferManagerAllowStall` — Stall writes when memtable memory exceeds `writeBufferManagerSize` (hard cap) instead of allowing brief overshoot with more aggressive flushing (soft cap); _Default_: `true`. (Added in: v5.1.0)

---

Expand Down
103 changes: 103 additions & 0 deletions reference/database/storage-tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,109 @@ Default: `true`

In-memory record caching of decoded records. Disable to reduce heap usage when records are large and unlikely to be re-read in the same process.

## RocksDB Memory

RocksDB exposes two large native memory pools that Harper makes tunable: a shared **block cache** for hot SST blocks, and a **WriteBufferManager** (enabled by default) that caps total memtable memory across every database in the process. These options apply only when `storage.engine` is `rocksdb`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the comma is unnecessary.

Suggested change
RocksDB exposes two large native memory pools that Harper makes tunable: a shared **block cache** for hot SST blocks, and a **WriteBufferManager** (enabled by default) that caps total memtable memory across every database in the process. These options apply only when `storage.engine` is `rocksdb`.
RocksDB exposes two large native memory pools that Harper makes tunable: a shared **block cache** for hot SST blocks and a **WriteBufferManager** (enabled by default) that caps total memtable memory across every database in the process. These options apply only when `storage.engine` is `rocksdb`.


### How RocksDB reads are cached

A read of a record that isn't in the memtable goes through three tiers before reaching disk:

1. **Block cache** (in-process, decompressed) — sized by `storage.rocks.blockCacheSize`. A hit returns in roughly a microsecond with no syscall and no decompression cost.
2. **OS page cache** (kernel, compressed SST file pages) — sized dynamically by the kernel from whatever memory isn't claimed by the process. A block-cache miss that hits the page cache costs a `read` syscall plus decompression — still on the order of microseconds, just an order of magnitude slower than the block cache.
3. **Disk** — if neither cache holds the page, RocksDB reads from the SST file directly.

Harper uses buffered I/O, so the OS page cache is always in play. The implication for sizing: shrinking the block cache doesn't directly translate to more disk reads — it shifts hits from the block cache (decompressed) to the OS page cache (compressed). The OS page cache also adjusts dynamically to host-wide memory pressure, which the block cache does not. Reserving less memory for the block cache leaves more for the page cache and for unrelated allocations on the host.

The trade-off favors a larger block cache when read latency matters and the working set fits; it favors a smaller block cache when memory pressure or noisy neighbors are the dominant concern.

### `storage.rocks.blockCacheSize`

<VersionBadge version="v5.1.0" />

Type: `number` (bytes)

Default: 25% of constrained (cgroup) or total memory

The shared LRU cache for decompressed SST blocks. Every RocksDB database in the process draws from this single pool.

The cache fills as blocks are read; it does **not** shrink on idle. Once the cache reaches its high-water mark for a workload, entries persist until LRU eviction or a manual capacity change. A long-running instance with a brief burst of activity will hold the cached blocks for the lifetime of the process.

```yaml
storage:
rocks:
blockCacheSize: 268435456 # 256 MB
```

Lower the cache size when:

- The host has limited memory headroom and the OS page cache is a meaningful second tier.
- Read access patterns favor a warm working set far smaller than 25% of memory.
- The instance runs under a strict cgroup limit and the headroom is needed for memtables or application heap.

Raise it (or leave at the default) when reads dominate and the working set comfortably fits at 25%.

### `storage.rocks.writeBufferManagerSize`

<VersionBadge version="v5.1.0" />

Type: `number` (bytes)

Default: one third of `blockCacheSize` (enabled). Set to `0` to disable.

Harper attaches a single RocksDB `WriteBufferManager` to every opened database in the process. Total memtable memory — including active memtables, immutable memtables awaiting flush, and the maintain-window history that RocksDB's OptimisticTransactionDB retains for conflict checking — is capped at this size across the entire process.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using monospace font for OptimisticTransactionDB.

Suggested change
Harper attaches a single RocksDB `WriteBufferManager` to every opened database in the process. Total memtable memory — including active memtables, immutable memtables awaiting flush, and the maintain-window history that RocksDB's OptimisticTransactionDB retains for conflict checking — is capped at this size across the entire process.
Harper attaches a single RocksDB `WriteBufferManager` to every opened database in the process. Total memtable memory — including active memtables, immutable memtables awaiting flush, and the maintain-window history that RocksDB's `OptimisticTransactionDB` retains for conflict checking — is capped at this size across the entire process.


Without a `WriteBufferManager`, each column family (table) manages its own memtable budget. The total grows with the number of column families: each one retains roughly `max_write_buffer_size_to_maintain` worth of recently-flushed memtables for snapshot reads and conflict detection. A database with many tables can accumulate hundreds of megabytes to a few gigabytes of resident anonymous memory before any cap is reached. The manager is enabled by default to bound that growth at a single limit.

Override the default to set an explicit budget, or disable it entirely:

```yaml
storage:
rocks:
writeBufferManagerSize: 268435456 # 256 MB total memtable budget (0 disables)
```

The configured size affects new databases opened after it is changed; existing open databases retain whatever budget they were attached with.

### `storage.rocks.writeBufferManagerCostToCache`

<VersionBadge version="v5.1.0" />

Type: `boolean`

Default: `true`

When `true`, memtable memory tracked by the `WriteBufferManager` is **charged against the block cache** as pinned cache entries. The block cache and write buffers then share a single accounting pool, visible through one operational metric (`rocksdb.block-cache-usage`).

This does not let the cache "shrink" to make room for writes — pinned entries cannot be evicted by LRU — but it unifies observability and bounds the combined memory footprint when `writeBufferManagerSize` is at or below `blockCacheSize`.

Has no effect when `storage.rocks.writeBufferManagerSize` is `0` or when the block cache is disabled.

```yaml
storage:
rocks:
blockCacheSize: 536870912 # 512 MB
writeBufferManagerSize: 268435456 # 256 MB
writeBufferManagerCostToCache: true
```

### `storage.rocks.writeBufferManagerAllowStall`

<VersionBadge version="v5.1.0" />

Type: `boolean`

Default: `true`

Controls behavior when memtable memory reaches `writeBufferManagerSize`:

- `false` (soft cap) — Memtables may briefly exceed the limit. RocksDB compensates by flushing more aggressively. Writes proceed without latency spikes; total memory may temporarily overshoot during bursts.
- `true` (hard cap) — Writes are stalled until flushes free up memory. Total memtable memory is strictly bounded; write latency can spike during bursts.

The default (`true`) strictly bounds total memtable memory, applying write backpressure rather than letting memtables overshoot — which also keeps bulk ingest from outrunning the memtable flush/conflict-check window. Set to `false` for a soft cap when write-latency smoothness matters more than a strict memory bound and brief overshoot during bursts is acceptable.

This option is the only `WriteBufferManager` setting that can be changed at runtime — `costToCache` is fixed at first creation.

## Storage Reclamation

`storage.reclamation` controls how Harper evicts data from caching tables (tables with [`sourcedFrom`](../resources/resource-api.md#sourcedfromresource-options)) when disk usage runs high. Reclamation does **not** affect non-caching tables — those rely on explicit deletion, TTL expiration, or [compaction](./compaction.md).
Expand Down