A Query.Farm VGI worker for DuckDB.
vgi-match · a Query.Farm VGI worker · powered by Splink
A VGI worker that brings probabilistic entity
resolution — record linkage and deduplication — to DuckDB/SQL, backed by
Splink (Ministry of
Justice, UK; MIT licensed). Hand it a relation of records and it returns the
same rows with an appended cluster_id: rows sharing a cluster_id are the
same real-world entity.
INSTALL vgi FROM community; LOAD vgi;
ATTACH 'match' (TYPE vgi, LOCATION 'uv run match_worker.py');
-- Dedup customers on name + email. Rows sharing cluster_id are the same person.
SELECT * FROM match.match_resolve(
(SELECT * FROM customers),
columns := 'first_name,last_name,email'
) ORDER BY cluster_id;
-- How many distinct real entities are in the relation?
SELECT count(DISTINCT cluster_id)
FROM match.match_resolve((SELECT * FROM customers),
columns := 'first_name,last_name,email');match_resolve is a table function that consumes a whole input relation —
passed as a single (SELECT ...) subquery (the positional argument) — and emits
the input rows unchanged, with two appended columns:
| appended column | meaning |
|---|---|
cluster_id (VARCHAR) |
resolved-entity id — rows sharing it are the same entity |
match_probability (DOUBLE) |
strongest pairwise match probability linking the row into its cluster (1.0 for a singleton) |
The linkage config is passed as named arguments:
| named arg | meaning |
|---|---|
columns := 'c1,c2,...' |
comma-separated comparison columns to match records on (required) |
threshold := 0.9 |
match-probability threshold in [0, 1] for linking pairs / forming clusters (default 0.5) |
Every column not named in columns is passed straight through to the
output untouched — so you can SELECT * your relation in and keep all your
fields. The comparison columns are also passed through; they just additionally
drive the matching.
Because entity resolution is an all-pairs whole-relation problem (every
record is conceptually compared against every other), match_resolve is a
buffering (Sink+Source) function: it buffers all input batches, then runs
Splink once over the full relation.
Splink makes all-pairs matching tractable with three ideas:
- Blocking — only compare record pairs that agree on at least one blocking rule (here: agree on any one of the comparison columns), so we never materialize the full O(n²) cross product.
- Comparisons — for each candidate pair, each comparison column is scored
into agreement levels using string similarity. By default
vgi-matchuses a fuzzy Jaro-WinklerNameComparisonfor name-like columns (first_name,last_name,surname, …) and a fuzzy Levenshtein comparison (≤1,≤2edits, exact) for everything else — so typos and nicknames still match. - The Fellegi-Sunter model — a calibrated
match_probabilityis computed per pair from per-levelm(probability of that agreement level given a true match) andu(given a non-match) probabilities. Pairs abovethresholdare linked, and the connected components of the resulting match graph become the entity clusters.
u is learned by random sampling; m and the overall match rate are learned by
unsupervised Expectation-Maximisation (EM) on the buffered data — no
labels required. This is the default path: supply only columns and the
worker builds a sensible default model and trains it on your data. (Splink emits
warnings on very small inputs where some agreement levels are never observed;
those parameters fall back to defaults and clustering still works.)
| function | returns | does |
|---|---|---|
match_resolve(rel, columns, [threshold]) |
every input column + cluster_id + match_probability, one row per input row |
dedup / resolve entities in rel |
# From a source checkout (stdio worker, exactly how DuckDB drives it after ATTACH):
uv run match_worker.py
# Or install the console script:
uv tool install vgi-match # provides `vgi-match` (stdio) and `vgi-match-http`Splink spins up its own DuckDB engine inside this worker process. Rows
arrive over Arrow from the caller's DuckDB, are buffered into a single pandas
frame, and are then handed to Splink's DuckDB to run the linkage. So this is
buffer-all-then-compute: peak memory in the worker holds the whole input
relation, plus Splink's intermediate blocked-pairs tables. That is exactly the
right shape for the small/medium relations entity resolution usually runs on
(customer lists, registries, product catalogs), but it is not a streaming
operator — it is not free on very large relations. Block tightly (use selective
columns) and resolve in partitions if you must scale up.
v1 ships the batch resolve path: train-on-this-data-then-cluster, in one
call. The linkage core (vgi_match/linkage.py) is already designed so a
pre-trained Splink settings object can be loaded and reused instead of
training — resolve(..., settings_json=...) consumes a settings dict/JSON
as-is and skips EM. A single-record match_score serving function (score one
new record against a previously-trained model without re-clustering the whole
relation) is a planned extension, not yet exposed as a SQL function — see
CLAUDE.md for the honest gap list.
vgi-match is MIT licensed (see LICENSE). Its dependencies:
| dependency | license | role |
|---|---|---|
| Splink | MIT | the entity-resolution engine |
| DuckDB (Splink's backend + the caller) | MIT | runs the linkage SQL |
| pandas | BSD-3-Clause | the buffered frame |
| pyarrow | Apache-2.0 | Arrow transport |
| Splink transitives (sqlglot MIT, igraph GPL-2.0, jsonschema MIT, numpy/altair BSD) | mixed permissive | SQL generation, graph clustering, validation |
Note on
igraph: Splink depends onpython-igraph, which is GPL-2.0. It is used at runtime (for connected-components clustering) and is not statically linked intovgi-match, but downstream redistributors who bundle the full dependency closure should be aware of it. Everythingvgi-matchitself ships is MIT.
Written by Query.Farm.
Copyright 2026 Query Farm LLC - https://query.farm
