Thursday, October 20, 2011

population count (POPCNT)

I'm wondering, some people don't know about primitive bits technique. For example, trying to count number of '1' bits in int may be done as follows:
int get_nbits(int x) {
  int cnt = 0;
  while(x) {
    cnt += x & 1;
    x >>= 1;
  }
  return cnt;
}

It's rough and not efficient. Any ways to make it faster? Couple of them (and, yea - give more of 'em in comments).

boost 1)

Look here: we have a number which is a power of 2. Such numbers has a great attribute: if you perform '&' with number-1, you'll always get a zero. This is because you always have "ones" in place you get zero in power-of-two.

Look at this: number 16 and 16-1=15:
  10000
& 01111
-------
      0
So what does it means? You always eliminate the lower 'one' bit. And example above may be easily rewritten as:
int count_bits(int x) {
  int cnt = 0;
  while (x) {
    x &= (x-1);
    cnt++; 
  }
  return cnt;
}

So, now we will perform no more than "number-of-ones" loops. Ok, but still doesn't efficient.

boost 2)

Why we should count a bits using N loops; we have to precompute 'em instead! So, use precomputed table for bytes, we may rewrite our example as follows:
static unsigned char nbits[256] = {
  /*   0 -  15 */ 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 
  /*  16 -  31 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 
  /*  32 -  47 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 
  /*  48 -  63 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /*  64 -  79 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 
  /*  80 -  95 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /*  96 - 111 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /* 112 - 127 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 
  /* 128 - 143 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 
  /* 144 - 159 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /* 160 - 175 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /* 176 - 191 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 
  /* 192 - 207 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /* 208 - 223 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 
  /* 224 - 239 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 
  /* 240 - 255 */ 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};

int count_bits(unsigned x)
{
  return nbits[x & 0xFFU] + nbits[(x >> 8) & 0xFFU] + nbits[(x >> 16) & 0xFFU] + nbits[x >> 24]
}

boost 3) Using POPCNT from Intel SSE4.2 command set

You may find brief description and links here.

boost 3.2

GCC has set of built-in commands which expands to CPU-instructions if present, or use software (but efficient) workarounds. As final example, I prefer to use this code:
//
// Uses SSE4 'POPCNT' instruction if present, or gcc-stub like 'popcount'
//
template int popcnt_modern(T);
template<> int popcnt_modern(unsigned x) { return __builtin_popcount(x); }
template<> int popcnt_modern(unsigned long x) { return __builtin_popcountl(x); }
template<> int popcnt_modern(unsigned long long x) { return __builtin_popcountll(x); }

Although It seems what gcc generates inefficient code unless your CPU provide SSE4.2. Hard to provide numbers here, but trust - code with byte-table is slightly faster.

No comments:

Post a Comment