-
Notifications
You must be signed in to change notification settings - Fork 9
docs(storage): document storage.rocks.* memory config options #496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kriszyp
wants to merge
3
commits into
main
Choose a base branch
from
docs/rocks-memory-config
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+107
−0
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -138,6 +138,109 @@ Default: `true` | |||||
|
|
||||||
| In-memory record caching of decoded records. Disable to reduce heap usage when records are large and unlikely to be re-read in the same process. | ||||||
|
|
||||||
| ## RocksDB Memory | ||||||
|
|
||||||
| RocksDB exposes two large native memory pools that Harper makes tunable: a shared **block cache** for hot SST blocks, and a **WriteBufferManager** (enabled by default) that caps total memtable memory across every database in the process. These options apply only when `storage.engine` is `rocksdb`. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe the comma is unnecessary.
Suggested change
|
||||||
|
|
||||||
| ### How RocksDB reads are cached | ||||||
|
|
||||||
| A read of a record that isn't in the memtable goes through three tiers before reaching disk: | ||||||
|
|
||||||
| 1. **Block cache** (in-process, decompressed) — sized by `storage.rocks.blockCacheSize`. A hit returns in roughly a microsecond with no syscall and no decompression cost. | ||||||
| 2. **OS page cache** (kernel, compressed SST file pages) — sized dynamically by the kernel from whatever memory isn't claimed by the process. A block-cache miss that hits the page cache costs a `read` syscall plus decompression — still on the order of microseconds, just an order of magnitude slower than the block cache. | ||||||
| 3. **Disk** — if neither cache holds the page, RocksDB reads from the SST file directly. | ||||||
|
|
||||||
| Harper uses buffered I/O, so the OS page cache is always in play. The implication for sizing: shrinking the block cache doesn't directly translate to more disk reads — it shifts hits from the block cache (decompressed) to the OS page cache (compressed). The OS page cache also adjusts dynamically to host-wide memory pressure, which the block cache does not. Reserving less memory for the block cache leaves more for the page cache and for unrelated allocations on the host. | ||||||
|
|
||||||
| The trade-off favors a larger block cache when read latency matters and the working set fits; it favors a smaller block cache when memory pressure or noisy neighbors are the dominant concern. | ||||||
|
|
||||||
| ### `storage.rocks.blockCacheSize` | ||||||
|
|
||||||
| <VersionBadge version="v5.1.0" /> | ||||||
|
|
||||||
| Type: `number` (bytes) | ||||||
|
|
||||||
| Default: 25% of constrained (cgroup) or total memory | ||||||
|
|
||||||
| The shared LRU cache for decompressed SST blocks. Every RocksDB database in the process draws from this single pool. | ||||||
|
|
||||||
| The cache fills as blocks are read; it does **not** shrink on idle. Once the cache reaches its high-water mark for a workload, entries persist until LRU eviction or a manual capacity change. A long-running instance with a brief burst of activity will hold the cached blocks for the lifetime of the process. | ||||||
|
|
||||||
| ```yaml | ||||||
| storage: | ||||||
| rocks: | ||||||
| blockCacheSize: 268435456 # 256 MB | ||||||
| ``` | ||||||
|
|
||||||
| Lower the cache size when: | ||||||
|
|
||||||
| - The host has limited memory headroom and the OS page cache is a meaningful second tier. | ||||||
| - Read access patterns favor a warm working set far smaller than 25% of memory. | ||||||
| - The instance runs under a strict cgroup limit and the headroom is needed for memtables or application heap. | ||||||
|
|
||||||
| Raise it (or leave at the default) when reads dominate and the working set comfortably fits at 25%. | ||||||
|
|
||||||
| ### `storage.rocks.writeBufferManagerSize` | ||||||
|
|
||||||
| <VersionBadge version="v5.1.0" /> | ||||||
|
|
||||||
| Type: `number` (bytes) | ||||||
|
|
||||||
| Default: one third of `blockCacheSize` (enabled). Set to `0` to disable. | ||||||
|
|
||||||
| Harper attaches a single RocksDB `WriteBufferManager` to every opened database in the process. Total memtable memory — including active memtables, immutable memtables awaiting flush, and the maintain-window history that RocksDB's OptimisticTransactionDB retains for conflict checking — is capped at this size across the entire process. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider using monospace font for
Suggested change
|
||||||
|
|
||||||
| Without a `WriteBufferManager`, each column family (table) manages its own memtable budget. The total grows with the number of column families: each one retains roughly `max_write_buffer_size_to_maintain` worth of recently-flushed memtables for snapshot reads and conflict detection. A database with many tables can accumulate hundreds of megabytes to a few gigabytes of resident anonymous memory before any cap is reached. The manager is enabled by default to bound that growth at a single limit. | ||||||
|
|
||||||
| Override the default to set an explicit budget, or disable it entirely: | ||||||
|
|
||||||
| ```yaml | ||||||
| storage: | ||||||
| rocks: | ||||||
| writeBufferManagerSize: 268435456 # 256 MB total memtable budget (0 disables) | ||||||
| ``` | ||||||
|
|
||||||
| The configured size affects new databases opened after it is changed; existing open databases retain whatever budget they were attached with. | ||||||
|
|
||||||
| ### `storage.rocks.writeBufferManagerCostToCache` | ||||||
|
|
||||||
| <VersionBadge version="v5.1.0" /> | ||||||
|
|
||||||
| Type: `boolean` | ||||||
|
|
||||||
| Default: `true` | ||||||
|
|
||||||
| When `true`, memtable memory tracked by the `WriteBufferManager` is **charged against the block cache** as pinned cache entries. The block cache and write buffers then share a single accounting pool, visible through one operational metric (`rocksdb.block-cache-usage`). | ||||||
|
|
||||||
| This does not let the cache "shrink" to make room for writes — pinned entries cannot be evicted by LRU — but it unifies observability and bounds the combined memory footprint when `writeBufferManagerSize` is at or below `blockCacheSize`. | ||||||
|
|
||||||
| Has no effect when `storage.rocks.writeBufferManagerSize` is `0` or when the block cache is disabled. | ||||||
|
|
||||||
| ```yaml | ||||||
| storage: | ||||||
| rocks: | ||||||
| blockCacheSize: 536870912 # 512 MB | ||||||
| writeBufferManagerSize: 268435456 # 256 MB | ||||||
| writeBufferManagerCostToCache: true | ||||||
| ``` | ||||||
|
|
||||||
| ### `storage.rocks.writeBufferManagerAllowStall` | ||||||
|
|
||||||
| <VersionBadge version="v5.1.0" /> | ||||||
|
|
||||||
| Type: `boolean` | ||||||
|
|
||||||
| Default: `true` | ||||||
|
|
||||||
| Controls behavior when memtable memory reaches `writeBufferManagerSize`: | ||||||
|
|
||||||
| - `false` (soft cap) — Memtables may briefly exceed the limit. RocksDB compensates by flushing more aggressively. Writes proceed without latency spikes; total memory may temporarily overshoot during bursts. | ||||||
| - `true` (hard cap) — Writes are stalled until flushes free up memory. Total memtable memory is strictly bounded; write latency can spike during bursts. | ||||||
|
|
||||||
| The default (`true`) strictly bounds total memtable memory, applying write backpressure rather than letting memtables overshoot — which also keeps bulk ingest from outrunning the memtable flush/conflict-check window. Set to `false` for a soft cap when write-latency smoothness matters more than a strict memory bound and brief overshoot during bursts is acceptable. | ||||||
|
|
||||||
| This option is the only `WriteBufferManager` setting that can be changed at runtime — `costToCache` is fixed at first creation. | ||||||
|
|
||||||
| ## Storage Reclamation | ||||||
|
|
||||||
| `storage.reclamation` controls how Harper evicts data from caching tables (tables with [`sourcedFrom`](../resources/resource-api.md#sourcedfromresource-options)) when disk usage runs high. Reclamation does **not** affect non-caching tables — those rely on explicit deletion, TTL expiration, or [compaction](./compaction.md). | ||||||
|
|
||||||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word "charge" is kinda confusing. It steps from the description in rocksdb-js which also is confusing.