Skip to content

Linux/macOS builds bake AVX/AVX2 based on the build machine's CPU, so 0.1.10-alpha wheels SIGILL on older x86_64 #302

@stumpylog

Description

@stumpylog

The Makefile's OMIT_SIMD block greps the build machine's /proc/cpuinfo and, when the builder has AVX, compiles the whole translation unit with -mavx -mavx2. GitHub runners all have AVX2, so release artifacts inherit it. Visible on the published wheels via vec_debug():

  • 0.1.9 linux x86_64: Build flags: empty — runs on any x86_64.
  • 0.1.10a4 linux x86_64: Build flags: avx rescore diskann — built with -mavx -mavx2 file-wide.

Reproduced under qemu-user: a basic insert + KNN workload on the 0.1.10a4 wheel dies with SIGILL under -cpu Westmere (SSE4.2, no AVX), while the 0.1.9 wheel passes the identical workload. Under -cpu SandyBridge (AVX, no AVX2) the basic workload happens to pass, but with -mavx2 applied file-wide the compiler is free to emit AVX2 in any function, so there is no guarantee for other code paths.

I'm a paperless-ngx maintainer, currently planning a move of its RAG vector store from LanceDB to sqlite-vec. A big part of the motivation is that 0.1.9's wheels run everywhere, while LanceDB's AVX2-baked wheels SIGILL on exactly this CPU class — Sandy/Ivy Bridge and Goldmont Atom/Celeron NAS boxes (paperless-ngx/paperless-ngx#12970). So I'd love for 0.1.10 final to keep 0.1.9's run-anywhere property.

The SIMD surface here looks small enough to have both speed and portability, and I'm happy to write the patch:

  1. __attribute__((target("avx"))) / (("avx2")) on l2_sqr_float_avx and distance_hamming_avx2 only.
  2. __builtin_cpu_supports() checks at their two existing dispatch sites (cached in a static; libgcc accounts for OS XSAVE state).
  3. Makefile: drop -mavx -mavx2, define SQLITE_VEC_ENABLE_AVX unconditionally on x86_64 gcc/clang.
  4. Optionally: vec_debug() reports runtime-detected features (additive line), plus an env override to force the scalar paths for testing.

The dispatch cost is a cached static int read per distance call, so AVX2-host throughput should be unchanged; I'd verify that with benchmarks as part of the PR.

The zero-code alternative is dropping -mavx -mavx2 from release builds (ship 0.1.10 the way 0.1.9 was built), at the cost of dead SIMD kernels in the wheels. Either works for me — which direction do you prefer?

Related: #211 (no sdist on PyPI, so affected users can't easily build from source as a fallback).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions