Add vec0 optimize command: compact sparse chunks after deletions#269
Add vec0 optimize command: compact sparse chunks after deletions#269asg017 wants to merge 1 commit into
Conversation
Implements FTS5-style INSERT INTO v(v) VALUES ('optimize') command that
packs live entries from newer/sparser chunks into free slots of older
chunks, then deletes emptied chunks. Adds hidden command column to vtab
schema, command dispatcher in xUpdate, and two-pointer compaction
algorithm that handles vectors, all metadata types, and partitioned tables.
Includes 16 Python tests, 7 C unit tests, and a libFuzzer target.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Hey @asg017 - happy to see this PR. We're integrating sqlite-vec into paperless-ngx as a local vector store for document embeddings. Documents get re-processed and deleted fairly often, so sparse chunks after deletions are a real concern for us - this optimize command is exactly what we need to keep the database file from growing unboundedly.
I used Claude Code to do a careful review of the diff and found four issues worth fixing before merge. Three are correctness bugs that would break tables with non-FLAT index types; the fourth is a minor memory leak under OOM. Detailed inline comments below.
Happy to help help here how I can to get this finished.
| i64 rowid = src->rowids[src_offset]; | ||
|
|
||
| // 1. Move vector data for each vector column | ||
| for (int i = 0; i < p->numVectorColumns; i++) { |
There was a problem hiding this comment.
The vector loop here calls sqlite3_blob_open on shadowVectorChunksNames[i] for every vector column unconditionally. The _vector_chunksNN shadow table is only created for VEC0_INDEX_TYPE_FLAT columns — DiskANN, IVF, and RESCORE use different storage layouts and never get this table. Every other place in the codebase that accesses these blobs guards with:
if (p->vector_columns[i].index_type != VEC0_INDEX_TYPE_FLAT)
continue;Without that guard, calling optimize on a table with any non-FLAT column returns a SQL error and leaves the table partially modified (since some moves may have already committed). The same guard is needed in vec0_optimize_delete_chunk — see the comment there.
| if (rc != SQLITE_DONE) return SQLITE_ERROR; | ||
|
|
||
| // Delete from each _vector_chunksNN | ||
| for (int i = 0; i < p->numVectorColumns; i++) { |
There was a problem hiding this comment.
Same non-FLAT guard issue as in vec0_optimize_move_entry: the vector loop here tries to DELETE FROM _vector_chunksNN for all columns, but that table only exists for FLAT index type columns.
Also: the canonical vec0Update_Delete_DeleteChunkIfEmpty includes a rescore_delete_chunk call that this function is missing:
#if SQLITE_VEC_ENABLE_RESCORE
rc = rescore_delete_chunk(p, chunk_id);
if (rc != SQLITE_OK) return rc;
#endifWithout it, _rescore_chunksNN rows for deleted chunks accumulate on every optimize call when rescore quantization is in use.
Since this function is essentially a variant of vec0Update_Delete_DeleteChunkIfEmpty, it might be worth factoring them into a shared helper to keep them in sync.
| // Read rowids blob | ||
| const void *rBlob = sqlite3_column_blob(stmtChunks, 2); | ||
| c->rowids_size = sqlite3_column_bytes(stmtChunks, 2); | ||
| c->rowids = sqlite3_malloc(c->rowids_size); |
There was a problem hiding this comment.
Small memory leak in this error path: if c->validity allocates successfully but c->rowids returns NULL, goto cleanup fires before nChunks++. The cleanup loop runs for (int i = 0; i < nChunks; i++) and never reaches chunks[nChunks].validity, which leaks.
Simplest fix is to move the increment before the rowids malloc — sqlite3_free(NULL) is a no-op so a half-initialized entry in the cleanup loop is safe:
c->validity = sqlite3_malloc(c->validity_size);
if (!c->validity) { rc = SQLITE_NOMEM; goto cleanup; }
memcpy(c->validity, vBlob, c->validity_size);
c->rowids_size = sqlite3_column_bytes(stmtChunks, 2);
c->rowids = sqlite3_malloc(c->rowids_size);
nChunks++; // increment before check — cleanup will free validity even if rowids is NULL
if (!c->rowids) { rc = SQLITE_NOMEM; goto cleanup; }
memcpy(c->rowids, rBlob, c->rowids_size);
Depends on #268
Adds a FTS5-style 'optimize' command to vec0 virtual tables.
Vectors in vec0 tables are stored in contiguous blocks next to each other. When vectors are deleted, their validity bit and data are zero'ed on.
On future inserts, sometimes vec0 will re-fill that space with new data, but not always. Meaning, a vec0 table can be larger than it needs to be, with "holes" across various chunks.
The new
'optimize'command allows a developer to start an "optimize" procedure on a specific vec0 table. It will re-arrange vectors/chunks to compact space (a defragmenter, if you will).I'm a little less sure of this change. Updating/deleting data is super important so I want to make sure I get this right.