Skip to content

Add vec0 optimize command: compact sparse chunks after deletions#269

Draft
asg017 wants to merge 1 commit into
asg017/delete-clear-implfrom
asg017/optimize-impl
Draft

Add vec0 optimize command: compact sparse chunks after deletions#269
asg017 wants to merge 1 commit into
asg017/delete-clear-implfrom
asg017/optimize-impl

Conversation

@asg017

@asg017 asg017 commented Mar 3, 2026

Copy link
Copy Markdown
Owner

Depends on #268

Adds a FTS5-style 'optimize' command to vec0 virtual tables.

Vectors in vec0 tables are stored in contiguous blocks next to each other. When vectors are deleted, their validity bit and data are zero'ed on.

On future inserts, sometimes vec0 will re-fill that space with new data, but not always. Meaning, a vec0 table can be larger than it needs to be, with "holes" across various chunks.

The new 'optimize' command allows a developer to start an "optimize" procedure on a specific vec0 table. It will re-arrange vectors/chunks to compact space (a defragmenter, if you will).

I'm a little less sure of this change. Updating/deleting data is super important so I want to make sure I get this right.

Implements FTS5-style INSERT INTO v(v) VALUES ('optimize') command that
packs live entries from newer/sparser chunks into free slots of older
chunks, then deletes emptied chunks. Adds hidden command column to vtab
schema, command dispatcher in xUpdate, and two-pointer compaction
algorithm that handles vectors, all metadata types, and partitioned tables.

Includes 16 Python tests, 7 C unit tests, and a libFuzzer target.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@stumpylog stumpylog left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @asg017 - happy to see this PR. We're integrating sqlite-vec into paperless-ngx as a local vector store for document embeddings. Documents get re-processed and deleted fairly often, so sparse chunks after deletions are a real concern for us - this optimize command is exactly what we need to keep the database file from growing unboundedly.

I used Claude Code to do a careful review of the diff and found four issues worth fixing before merge. Three are correctness bugs that would break tables with non-FLAT index types; the fourth is a minor memory leak under OOM. Detailed inline comments below.

Happy to help help here how I can to get this finished.

Comment thread sqlite-vec.c
i64 rowid = src->rowids[src_offset];

// 1. Move vector data for each vector column
for (int i = 0; i < p->numVectorColumns; i++) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vector loop here calls sqlite3_blob_open on shadowVectorChunksNames[i] for every vector column unconditionally. The _vector_chunksNN shadow table is only created for VEC0_INDEX_TYPE_FLAT columns — DiskANN, IVF, and RESCORE use different storage layouts and never get this table. Every other place in the codebase that accesses these blobs guards with:

if (p->vector_columns[i].index_type != VEC0_INDEX_TYPE_FLAT)
    continue;

Without that guard, calling optimize on a table with any non-FLAT column returns a SQL error and leaves the table partially modified (since some moves may have already committed). The same guard is needed in vec0_optimize_delete_chunk — see the comment there.

Comment thread sqlite-vec.c
if (rc != SQLITE_DONE) return SQLITE_ERROR;

// Delete from each _vector_chunksNN
for (int i = 0; i < p->numVectorColumns; i++) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same non-FLAT guard issue as in vec0_optimize_move_entry: the vector loop here tries to DELETE FROM _vector_chunksNN for all columns, but that table only exists for FLAT index type columns.

Also: the canonical vec0Update_Delete_DeleteChunkIfEmpty includes a rescore_delete_chunk call that this function is missing:

#if SQLITE_VEC_ENABLE_RESCORE
    rc = rescore_delete_chunk(p, chunk_id);
    if (rc != SQLITE_OK) return rc;
#endif

Without it, _rescore_chunksNN rows for deleted chunks accumulate on every optimize call when rescore quantization is in use.

Since this function is essentially a variant of vec0Update_Delete_DeleteChunkIfEmpty, it might be worth factoring them into a shared helper to keep them in sync.

Comment thread sqlite-vec.c
// Read rowids blob
const void *rBlob = sqlite3_column_blob(stmtChunks, 2);
c->rowids_size = sqlite3_column_bytes(stmtChunks, 2);
c->rowids = sqlite3_malloc(c->rowids_size);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small memory leak in this error path: if c->validity allocates successfully but c->rowids returns NULL, goto cleanup fires before nChunks++. The cleanup loop runs for (int i = 0; i < nChunks; i++) and never reaches chunks[nChunks].validity, which leaks.

Simplest fix is to move the increment before the rowids malloc — sqlite3_free(NULL) is a no-op so a half-initialized entry in the cleanup loop is safe:

c->validity = sqlite3_malloc(c->validity_size);
if (!c->validity) { rc = SQLITE_NOMEM; goto cleanup; }
memcpy(c->validity, vBlob, c->validity_size);

c->rowids_size = sqlite3_column_bytes(stmtChunks, 2);
c->rowids = sqlite3_malloc(c->rowids_size);
nChunks++;  // increment before check — cleanup will free validity even if rowids is NULL
if (!c->rowids) { rc = SQLITE_NOMEM; goto cleanup; }
memcpy(c->rowids, rBlob, c->rowids_size);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants