The Makefile's OMIT_SIMD block greps the build machine's /proc/cpuinfo and, when the builder has AVX, compiles the whole translation unit with -mavx -mavx2. GitHub runners all have AVX2, so release artifacts inherit it. Visible on the published wheels via vec_debug():
0.1.9 linux x86_64: Build flags: empty — runs on any x86_64.
0.1.10a4 linux x86_64: Build flags: avx rescore diskann — built with -mavx -mavx2 file-wide.
Reproduced under qemu-user: a basic insert + KNN workload on the 0.1.10a4 wheel dies with SIGILL under -cpu Westmere (SSE4.2, no AVX), while the 0.1.9 wheel passes the identical workload. Under -cpu SandyBridge (AVX, no AVX2) the basic workload happens to pass, but with -mavx2 applied file-wide the compiler is free to emit AVX2 in any function, so there is no guarantee for other code paths.
I'm a paperless-ngx maintainer, currently planning a move of its RAG vector store from LanceDB to sqlite-vec. A big part of the motivation is that 0.1.9's wheels run everywhere, while LanceDB's AVX2-baked wheels SIGILL on exactly this CPU class — Sandy/Ivy Bridge and Goldmont Atom/Celeron NAS boxes (paperless-ngx/paperless-ngx#12970). So I'd love for 0.1.10 final to keep 0.1.9's run-anywhere property.
The SIMD surface here looks small enough to have both speed and portability, and I'm happy to write the patch:
__attribute__((target("avx"))) / (("avx2")) on l2_sqr_float_avx and distance_hamming_avx2 only.
__builtin_cpu_supports() checks at their two existing dispatch sites (cached in a static; libgcc accounts for OS XSAVE state).
- Makefile: drop
-mavx -mavx2, define SQLITE_VEC_ENABLE_AVX unconditionally on x86_64 gcc/clang.
- Optionally:
vec_debug() reports runtime-detected features (additive line), plus an env override to force the scalar paths for testing.
The dispatch cost is a cached static int read per distance call, so AVX2-host throughput should be unchanged; I'd verify that with benchmarks as part of the PR.
The zero-code alternative is dropping -mavx -mavx2 from release builds (ship 0.1.10 the way 0.1.9 was built), at the cost of dead SIMD kernels in the wheels. Either works for me — which direction do you prefer?
Related: #211 (no sdist on PyPI, so affected users can't easily build from source as a fallback).
The Makefile's
OMIT_SIMDblock greps the build machine's/proc/cpuinfoand, when the builder has AVX, compiles the whole translation unit with-mavx -mavx2. GitHub runners all have AVX2, so release artifacts inherit it. Visible on the published wheels viavec_debug():0.1.9linux x86_64:Build flags:empty — runs on any x86_64.0.1.10a4linux x86_64:Build flags: avx rescore diskann— built with-mavx -mavx2file-wide.Reproduced under qemu-user: a basic insert + KNN workload on the
0.1.10a4wheel dies with SIGILL under-cpu Westmere(SSE4.2, no AVX), while the0.1.9wheel passes the identical workload. Under-cpu SandyBridge(AVX, no AVX2) the basic workload happens to pass, but with-mavx2applied file-wide the compiler is free to emit AVX2 in any function, so there is no guarantee for other code paths.I'm a paperless-ngx maintainer, currently planning a move of its RAG vector store from LanceDB to sqlite-vec. A big part of the motivation is that 0.1.9's wheels run everywhere, while LanceDB's AVX2-baked wheels SIGILL on exactly this CPU class — Sandy/Ivy Bridge and Goldmont Atom/Celeron NAS boxes (paperless-ngx/paperless-ngx#12970). So I'd love for 0.1.10 final to keep 0.1.9's run-anywhere property.
The SIMD surface here looks small enough to have both speed and portability, and I'm happy to write the patch:
__attribute__((target("avx")))/(("avx2"))onl2_sqr_float_avxanddistance_hamming_avx2only.__builtin_cpu_supports()checks at their two existing dispatch sites (cached in a static; libgcc accounts for OS XSAVE state).-mavx -mavx2, defineSQLITE_VEC_ENABLE_AVXunconditionally on x86_64 gcc/clang.vec_debug()reports runtime-detected features (additive line), plus an env override to force the scalar paths for testing.The dispatch cost is a cached static int read per distance call, so AVX2-host throughput should be unchanged; I'd verify that with benchmarks as part of the PR.
The zero-code alternative is dropping
-mavx -mavx2from release builds (ship 0.1.10 the way 0.1.9 was built), at the cost of dead SIMD kernels in the wheels. Either works for me — which direction do you prefer?Related: #211 (no sdist on PyPI, so affected users can't easily build from source as a fallback).