int get_nbits(int x) {
  int cnt = 0;
  while(x) {
    cnt += x & 1;
    x >>= 1;
  }
  return cnt;
}
It's rough and not efficient. Any ways to make it faster? Couple of them (and, yea - give more of 'em in comments).
boost 1)
Look here: we have a number which is a power of 2. Such numbers has a great attribute: if you perform '&' with number-1, you'll always get a zero. This is because you always have "ones" in place you get zero in power-of-two. Look at this: number 16 and 16-1=15:
  10000
& 01111
-------
      0
So what does it means? You always eliminate the lower 'one' bit. And example above may be easily rewritten as:
int count_bits(int x) {
  int cnt = 0;
  while (x) {
    x &= (x-1);
    cnt++; 
  }
  return cnt;
}
So, now we will perform no more than "number-of-ones" loops. Ok, but still doesn't efficient.
boost 2)
Why we should count a bits using N loops; we have to precompute 'em instead! So, use precomputed table for bytes, we may rewrite our example as follows:
static unsigned char nbits[256] = {
  /*   0 -  15 */ 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 
  /*  16 -  31 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 
  /*  32 -  47 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 
  /*  48 -  63 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /*  64 -  79 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 
  /*  80 -  95 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /*  96 - 111 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /* 112 - 127 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 
  /* 128 - 143 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 
  /* 144 - 159 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /* 160 - 175 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /* 176 - 191 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 
  /* 192 - 207 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /* 208 - 223 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 
  /* 224 - 239 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 
  /* 240 - 255 */ 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
int count_bits(unsigned x)
{
  return nbits[x & 0xFFU] + nbits[(x >> 8) & 0xFFU] + nbits[(x >> 16) & 0xFFU] + nbits[x >> 24]
}
boost 3) Using POPCNT from Intel SSE4.2 command set
You may find brief description and links here.boost 3.2
GCC has set of built-in commands which expands to CPU-instructions if present, or use software (but efficient) workarounds. As final example, I prefer to use this code:// // Uses SSE4 'POPCNT' instruction if present, or gcc-stub like 'popcount' // templateint popcnt_modern(T); template<> int popcnt_modern(unsigned x) { return __builtin_popcount(x); } template<> int popcnt_modern(unsigned long x) { return __builtin_popcountl(x); } template<> int popcnt_modern(unsigned long long x) { return __builtin_popcountll(x); } 
Although It seems what gcc generates inefficient code unless your CPU provide SSE4.2. Hard to provide numbers here, but trust - code with byte-table is slightly faster.
 
No comments:
Post a Comment