Hi, I think there is a bug in the AVX2 popcount implementation in the Shuffles and Popcount section.
Current code:
for (; k + 15 < N; k += 15) {
reg s = _mm256_setzero_si256();
for (int i = 0; i < 15; i += 8) {
reg x = _mm256_load_si256( (reg*) &a[k + i] );
...
}
...
}
Here, k indexes int elements, while one AVX2 register processes 8 ints.
The inner loop loads from:
a[k + 0] ... a[k + 7]
a[k + 8] ... a[k + 15]
so it processes 16 elements, but the outer loop only advances by 15:
This causes overlapping between chunks, e.g. element 15 can be counted twice.
I think the intended version should process 15 AVX2 registers per outer iteration, i.e. 15 * 8 integers:
int popcnt() {
int k = 0;
reg t = _mm256_setzero_si256();
for (; k + 15 * 8 <= N; k += 15 * 8) {
reg s = _mm256_setzero_si256();
for (int i = 0; i < 15; i++) {
reg x = _mm256_load_si256( (reg*) &a[k + i * 8] );
reg l = _mm256_and_si256(x, low_mask);
reg h = _mm256_and_si256(_mm256_srli_epi16(x, 4), low_mask);
reg pl = _mm256_shuffle_epi8(lookup, l);
reg ph = _mm256_shuffle_epi8(lookup, h);
s = _mm256_add_epi8(s, pl);
s = _mm256_add_epi8(s, ph);
}
t = _mm256_add_epi64(t, _mm256_sad_epu8(s, _mm256_setzero_si256()));
}
int res = hsum(t);
while (k < N)
res += __builtin_popcount(a[k++]);
return res;
}
This matches the explanation below the code: the loop stops every 15 vector iterations because the 8-bit counters can overflow.
Thanks for the great resource!
Hi, I think there is a bug in the AVX2 popcount implementation in the Shuffles and Popcount section.
Current code:
Here,
kindexesintelements, while one AVX2 register processes 8ints.The inner loop loads from:
so it processes 16 elements, but the outer loop only advances by 15:
k += 15This causes overlapping between chunks, e.g. element
15can be counted twice.I think the intended version should process 15 AVX2 registers per outer iteration, i.e.
15 * 8integers:This matches the explanation below the code: the loop stops every 15 vector iterations because the 8-bit counters can overflow.
Thanks for the great resource!