How to count the number of set bits in a 32-bit integer?

algorithm binary bit-manipulation hammingweight iec10967

587,170

Solution 1

This is known as the 'Hamming Weight', 'popcount' or 'sideways addition'.

Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. Instructions like x86's popcnt (on CPUs where it's supported) will almost certainly be fastest for a single integer. Some other architectures may have a slow instruction implemented with a microcoded loop that tests a bit per cycle (citation needed - hardware popcount is normally fast if it exists at all.).

The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.

Your compiler may know how to do something that's good for the specific CPU you're compiling for, e.g. C++20 std::popcount(), or C++ std::bitset<32>::count(), as a portable way to access builtin / intrinsic functions (see another answer on this question). But your compiler's choice of fallback for target CPUs that don't have hardware popcnt might not be optimal for your use-case. Or your language (e.g. C) might not expose any portable function that could use a CPU-specific popcount when there is one.

Portable algorithms that don't need (or benefit from) any HW support

A pre-populated table lookup method can be very fast if your CPU has a large cache and you are doing lots of these operations in a tight loop. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory. (Look up each byte separately to keep the table small.) If you want popcount for a contiguous range of numbers, only the low byte is changing for groups of 256 numbers, making this very good.

If you know that your bytes will be mostly 0's or mostly 1's then there are efficient algorithms for these scenarios, e.g. clearing the lowest set with a bithack in a loop until it becomes zero.

I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm'. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and >>> in Java):

GCC10 and clang 10.0 can recognize this pattern / idiom and compile it to a hardware popcnt or equivalent instruction when available, giving you the best of both worlds. (https://godbolt.org/z/qGdh1dvKK)

int numberOfSetBits(uint32_t i)
{
     // Java: use int, and use >>> instead of >>. Or use Integer.bitCount()
     // C or C++: use uint32_t
     i = i - ((i >> 1) & 0x55555555);        // add pairs of bits
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);  // quads
     i = (i + (i >> 4)) & 0x0F0F0F0F;        // groups of 8
     return (i * 0x01010101) >> 24;          // horizontal sum of bytes
}

For JavaScript: coerce to integer with |0 for performance: change the first line to i = (i|0) - ((i >> 1) & 0x55555555);

This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it. (Its performance is not data-dependent on normal CPUs where all integer operations including multiply are constant-time. It doesn't get any faster with "simple" inputs, but it's still pretty decent.)

References:

How this SWAR bithack works:

i = i - ((i >> 1) & 0x55555555);

The first step is an optimized version of masking to isolate the odd / even bits, shifting to line them up, and adding. This effectively does 16 separate additions in 2-bit accumulators (SWAR = SIMD Within A Register). Like (i & 0x55555555) + ((i>>1) & 0x55555555).

The next step takes the odd/even eight of those 16x 2-bit accumulators and adds again, producing 8x 4-bit sums. The i - ... optimization isn't possible this time so it does just mask before / after shifting. Using the same 0x33... constant both times instead of 0xccc... before shifting is a good thing when compiling for ISAs that need to construct 32-bit constants in registers separately.

The final shift-and-add step of (i + (i >> 4)) & 0x0F0F0F0F widens to 4x 8-bit accumulators. It masks after adding instead of before, because the maximum value in any 4-bit accumulator is 4, if all 4 bits of the corresponding input bits were set. 4+4 = 8 which still fits in 4 bits, so carry between nibble elements is impossible in i + (i >> 4).

So far this is just fairly normal SIMD using SWAR techniques with a few clever optimizations. Continuing on with the same pattern for 2 more steps can widen to 2x 16-bit then 1x 32-bit counts. But there is a more efficient way on machines with fast hardware multiply:

Once we have few enough "elements", a multiply with a magic constant can sum all the elements into the top element. In this case byte elements. Multiply is done by left-shifting and adding, so a multiply of x * 0x01010101 results in x + (x<<8) + (x<<16) + (x<<24). Our 8-bit elements are wide enough (and holding small enough counts) that this doesn't produce carry into that top 8 bits.

A 64-bit version of this can do 8x 8-bit elements in a 64-bit integer with a 0x0101010101010101 multiplier, and extract the high byte with >>56. So it doesn't take any extra steps, just wider constants. This is what GCC uses for __builtin_popcountll on x86 systems when the hardware popcnt instruction isn't enabled. If you can use builtins or intrinsics for this, do so to give the compiler a chance to do target-specific optimizations.

With full SIMD for wider vectors (e.g. counting a whole array)

This bitwise-SWAR algorithm could parallelize to be done in multiple vector elements at once, instead of in a single integer register, for a speedup on CPUs with SIMD but no usable popcount instruction. (e.g. x86-64 code that has to run on any CPU, not just Nehalem or later.)

However, the best way to use vector instructions for popcount is usually by using a variable-shuffle to do a table-lookup for 4 bits at a time of each byte in parallel. (The 4 bits index a 16 entry table held in a vector register).

On Intel CPUs, the hardware 64bit popcnt instruction can outperform an SSSE3 PSHUFB bit-parallel implementation by about a factor of 2, but only if your compiler gets it just right. Otherwise SSE can come out significantly ahead. Newer compiler versions are aware of the popcnt false dependency problem on Intel.

https://github.com/WojciechMula/sse-popcount state-of-the-art x86 SIMD popcount for SSSE3, AVX2, AVX512BW, AVX512VBMI, or AVX512 VPOPCNT. Using Harley-Seal across vectors to defer popcount within an element. (Also ARM NEON)
Counting 1 bits (population count) on large data using AVX-512 or AVX-2
related: https://github.com/mklarqvist/positional-popcount - separate counts for each bit-position of multiple 8, 16, 32, or 64-bit integers. (Again, x86 SIMD including AVX-512 which is really good at this, with vpternlogd making Harley-Seal very good.)

Solution 2

Some languages portably expose the operation in a way that can use efficient hardware support if available, otherwise some library fallback that's hopefully decent.

For example (from a table by language):

C++ has std::bitset<>::count(), or C++20 std::popcount(T x)
Java has java.lang.Integer.bitCount() (also for Long or BigInteger)
C# has System.Numerics.BitOperations.PopCount()
Python has int.bit_count() (since 3.10)

Not all compilers / libraries actually manage to use HW support when it's available, though. (Notably MSVC, even with options that make std::popcount inline as x86 popcnt, its std::bitset::count still always uses a lookup table. This will hopefully change in future versions.)

Also consider the built-in functions of your compiler when the portable language doesn't have this basic bit operation. In GNU C for example:

int __builtin_popcount (unsigned int x);
int __builtin_popcountll (unsigned long long x);

In the worst case (no single-instruction HW support) the compiler will generate a call to a function (which in current GCC uses a shift/and bit-hack like this answer, at least for x86). In the best case the compiler will emit a cpu instruction to do the job. (Just like a * or / operator - GCC will use a hardware multiply or divide instruction if available, otherwise will call a libgcc helper function.) Or even better, if the operand is a compile-time constant after inlining, it can do constant-propagation to get a compile-time-constant popcount result.

The GCC builtins even work across multiple platforms. Popcount has almost become mainstream in the x86 architecture, so it makes sense to start using the builtin now so you can recompile to let it inline a hardware instruction when you compile with -mpopcnt or something that includes that (e.g. https://godbolt.org/z/Ma5e5a). Other architectures have had popcount for years, but in the x86 world there are still some ancient Core 2 and similar vintage AMD CPUs in use.

On x86, you can tell the compiler that it can assume support for popcnt instruction with -mpopcnt (also implied by -msse4.2). See GCC x86 options. -march=nehalem -mtune=skylake (or -march= whatever CPU you want your code to assume and to tune for) could be a good choice. Running the resulting binary on an older CPU will result in an illegal-instruction fault.

To make binaries optimized for the machine you build them on, use -march=native (with gcc, clang, or ICC).

MSVC provides an intrinsic for the x86 popcnt instruction, but unlike gcc it's really an intrinsic for the hardware instruction and requires hardware support.

Using `std::bitset<>::count()` instead of a built-in

In theory, any compiler that knows how to popcount efficiently for the target CPU should expose that functionality through ISO C++ std::bitset<>. In practice, you might be better off with the bit-hack AND/shift/ADD in some cases for some target CPUs.

For target architectures where hardware popcount is an optional extension (like x86), not all compilers have a std::bitset that takes advantage of it when available. For example, MSVC has no way to enable popcnt support at compile time, and it's std::bitset<>::count always uses a table lookup, even with /Ox /arch:AVX (which implies SSE4.2, which in turn implies the popcnt feature.) (Update: see below; that does get MSVC's C++20 std::popcount to use x86 popcnt, but still not its bitset<>::count. MSVC could fix that by updating their standard library headers to use std::popcount when available.)

But at least you get something portable that works everywhere, and with gcc/clang with the right target options, you get hardware popcount for architectures that support it.

#include <bitset>
#include <limits>
#include <type_traits>

template<typename T>
//static inline  // static if you want to compile with -mpopcnt in one compilation unit but not others
typename std::enable_if<std::is_integral<T>::value,  unsigned >::type 
popcount(T x)
{
    static_assert(std::numeric_limits<T>::radix == 2, "non-binary type");

    // sizeof(x)*CHAR_BIT
    constexpr int bitwidth = std::numeric_limits<T>::digits + std::numeric_limits<T>::is_signed;
    // std::bitset constructor was only unsigned long before C++11.  Beware if porting to C++03
    static_assert(bitwidth <= std::numeric_limits<unsigned long long>::digits, "arg too wide for std::bitset() constructor");

    typedef typename std::make_unsigned<T>::type UT;        // probably not needed, bitset width chops after sign-extension

    std::bitset<bitwidth> bs( static_cast<UT>(x) );
    return bs.count();
}

See asm from gcc, clang, icc, and MSVC on the Godbolt compiler explorer.

x86-64 gcc -O3 -std=gnu++11 -mpopcnt emits this:

unsigned test_short(short a) { return popcount(a); }
    movzx   eax, di      # note zero-extension, not sign-extension
    popcnt  rax, rax
    ret

unsigned test_int(int a) { return popcount(a); }
    mov     eax, edi
    popcnt  rax, rax        # unnecessary 64-bit operand size
    ret

unsigned test_u64(unsigned long long a) { return popcount(a); }
    xor     eax, eax     # gcc avoids false dependencies for Intel CPUs
    popcnt  rax, rdi
    ret

PowerPC64 gcc -O3 -std=gnu++11 emits (for the int arg version):

    rldicl 3,3,0,32     # zero-extend from 32 to 64-bit
    popcntd 3,3         # popcount
    blr

This source isn't x86-specific or GNU-specific at all, but only compiles well with gcc/clang/icc, at least when targeting x86 (including x86-64).

Also note that gcc's fallback for architectures without single-instruction popcount is a byte-at-a-time table lookup. This isn't wonderful for ARM, for example.

C++20 has `std::popcount(T)`

Current libstdc++ headers unfortunately define it with a special case if(x==0) return 0; at the start, which clang doesn't optimize away when compiling for x86:

#include <bit>
int bar(unsigned x) {
    return std::popcount(x);
}

clang 11.0.1 -O3 -std=gnu++20 -march=nehalem (https://godbolt.org/z/arMe5a)

# clang 11
    bar(unsigned int):                                # @bar(unsigned int)
        popcnt  eax, edi
        cmove   eax, edi         # redundant: if popcnt result is 0, return the original 0 instead of the popcnt-generated 0...
        ret

But GCC compiles nicely:

# gcc 10
        xor     eax, eax         # break false dependency on Intel SnB-family before Ice Lake.
        popcnt  eax, edi
        ret

Even MSVC does well with it, as long as you use -arch:AVX or later (and enable C++20 with -std:c++latest). https://godbolt.org/z/7K4Gef

int bar(unsigned int) PROC                                 ; bar, COMDAT
        popcnt  eax, ecx
        ret     0
int bar(unsigned int) ENDP                                 ; bar

Solution 3

In my opinion, the "best" solution is the one that can be read by another programmer (or the original programmer two years later) without copious comments. You may well want the fastest or cleverest solution which some have already provided but I prefer readability over cleverness any time.

unsigned int bitCount (unsigned int value) {
    unsigned int count = 0;
    while (value > 0) {           // until all bits are zero
        if ((value & 1) == 1)     // check lower bit
            count++;
        value >>= 1;              // shift bits, removing lower bit
    }
    return count;
}

If you want more speed (and assuming you document it well to help out your successors), you could use a table lookup:

// Lookup table for fast calculation of bits set in 8-bit unsigned char.

static unsigned char oneBitsInUChar[] = {
//  0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F (<- n)
//  =====================================================
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, // 0n
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, // 1n
    : : :
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8, // Fn
};

// Function for fast calculation of bits set in 16-bit unsigned short.

unsigned char oneBitsInUShort (unsigned short x) {
    return oneBitsInUChar [x >>    8]
         + oneBitsInUChar [x &  0xff];
}

// Function for fast calculation of bits set in 32-bit unsigned int.

unsigned char oneBitsInUInt (unsigned int x) {
    return oneBitsInUShort (x >>     16)
         + oneBitsInUShort (x &  0xffff);
}

Although these rely on specific data type sizes so they're not that portable. But, since many performance optimisations aren't portable anyway, that may not be an issue. If you want portability, I'd stick to the readable solution.

Solution 4

From Hacker's Delight, p. 66, Figure 5-2

int pop(unsigned x)
{
    x = x - ((x >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F;
    x = x + (x >> 8);
    x = x + (x >> 16);
    return x & 0x0000003F;
}

Executes in ~20-ish instructions (arch dependent), no branching.

Hacker's Delight is delightful! Highly recommended.

Solution 5

I think the fastest way—without using lookup tables and popcount—is the following. It counts the set bits with just 12 operations.

int popcount(int v) {
    v = v - ((v >> 1) & 0x55555555);                // put count of each 2 bits into those 2 bits
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // put count of each 4 bits into those 4 bits  
    return c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}

It works because you can count the total number of set bits by dividing in two halves, counting the number of set bits in both halves and then adding them up. Also know as Divide and Conquer paradigm. Let's get into detail..

v = v - ((v >> 1) & 0x55555555);

The number of bits in two bits can be 0b00, 0b01 or 0b10. Lets try to work this out on 2 bits..

 ---------------------------------------------
 |   v    |   (v >> 1) & 0b0101   |  v - x   |
 ---------------------------------------------
   0b00           0b00               0b00   
   0b01           0b00               0b01     
   0b10           0b01               0b01
   0b11           0b01               0b10

This is what was required: the last column shows the count of set bits in every two bit pair. If the two bit number is >= 2 (0b10) then and produces 0b01, else it produces 0b00.

v = (v & 0x33333333) + ((v >> 2) & 0x33333333);

This statement should be easy to understand. After the first operation we have the count of set bits in every two bits, now we sum up that count in every 4 bits.

v & 0b00110011         //masks out even two bits
(v >> 2) & 0b00110011  // masks out odd two bits

We then sum up the above result, giving us the total count of set bits in 4 bits. The last statement is the most tricky.

c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;

Let's break it down further...

v + (v >> 4)

It's similar to the second statement; we are counting the set bits in groups of 4 instead. We know—because of our previous operations—that every nibble has the count of set bits in it. Let's look an example. Suppose we have the byte 0b01000010. It means the first nibble has its 4bits set and the second one has its 2bits set. Now we add those nibbles together.

0b01000010 + 0b01000000

It gives us the count of set bits in a byte, in the first nibble 0b01100010 and therefore we mask the last four bytes of all the bytes in the number (discarding them).

0b01100010 & 0xF0 = 0b01100000

Now every byte has the count of set bits in it. We need to add them up all together. The trick is to multiply the result by 0b10101010 which has an interesting property. If our number has four bytes, A B C D, it will result in a new number with these bytes A+B+C+D B+C+D C+D D. A 4 byte number can have maximum of 32 bits set, which can be represented as 0b00100000.

All we need now is the first byte which has the sum of all set bits in all the bytes, and we get it by >> 24. This algorithm was designed for 32 bit words but can be easily modified for 64 bit words.

View more solutions

587,170

Author by

Matt Howells

I'm a software architect and senior developer, currently working in the investment banking industry.

Updated on July 08, 2022

Comments

Matt Howells almost 2 years
8 bits representing the number 7 look like this:
```
00000111
```
Three bits are set.

What are algorithms to determine the number of set bits in a 32-bit integer?
Matt Howells over 15 years

That'll work and is easy to understand, but there are faster methods.
Horcrux7 over 15 years

My code has 10 operation. Your code has 12 operation. Your link work with smaller arrays (5). I use 256 elements. With the caching can be a problem. But if you use it very frequently then this is not a problem.
Admin over 15 years

Unless you do this a LOT, the performance impact would be negligible. So all things being equal, I agree with daniel that 'best' implies "doesn't read like gibberish".
Matt Howells over 15 years

I deliberately didn't define 'best', to get a variety of methods. Lets face it if we have got down to the level of this sort of bit-twiddling we are probably looking for something uber-fast that looks like a chimp has typed it.
Admin over 15 years

I agree that this is good practice in general, but on XCode/OSX/Intel I found it to generate slower code than most of the suggestions posted here. See my answer for details.
indiv over 15 years

Instead of dividing by 2 and commenting it as "shift bits...", you should just use the shift operator (>>) and leave out the comment.
Admin over 15 years

This approach is measurably quite a bit faster than the bit-twiddling approach, as it turns out. As for using more memory, it compiles to less code and that gain is repeated every time you inline the function. So it could easily turn out to be a net win.
Mecki over 15 years

Bad code. A compiler might make good one out of it, but in my tests GCC did not. Replace (n%2) with (n&1); AND being much faster than MODULO. Replace (n/=2) with (n>>=1); bitshifting much faster than division.
Ankit Roy almost 15 years

+1. The first line in your NumberOfSetBits() is very cool -- just 3 instructions, instead of the 4 you would need if you separately masked out the even- and odd-numbered bits and added them (appropriately shifted) together.
Jason S over 14 years

ha! love the NumberOfSetBits() function, but good luck getting that through a code review. :-)
matja over 14 years

The Intel i5/i7 has the SSE4 instruction POPCNT which does it, using general purpose registers. GCC on my system does not emit that instruction using this intrinsic, i guess because of no -march=nehalem option yet.
Nils Pipenbrinck over 14 years

@matja, my GCC 4.4.1 emits the popcnt instruction if I compile with -msse4.2
Exectron over 14 years

Maybe it should use unsigned int, to easily show that it is free of any sign bit complications. Also would uint32_t be safer, as in, you get what you expect on all platforms?
Ponkadoodle about 14 years

wouldn't it make more sense to replace if ((value & 1) == 1) { count++; } with count += value & 1?
Maciej Hehl about 14 years

I checked a few things while answering to this question stackoverflow.com/questions/2709430/… and I think there is a bug. last line: return ((i + (i >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; should be changed to: return (((i + (i >> 4)) & 0xF0F0F0F) * 0x1010101) >> 24; (first the sum and bitwise & later)
Matt Howells about 14 years

@Maciej: Can you please provide a case where it does not return the expected result?
Maciej Hehl about 14 years

@Matt Howells I'm sorry. It looks like I got lost among those parenthesis. I had a bug in my implementation. it didn't work for numbers > 15. I checked the wikipedia article and there were those parenthesis. I was sure that was my problem. I fixed the parenthesis and it started to work. Looks like I fixed something else. But I think I learned something. Inability to reproduce the "bug" made me check the precedence of the operators :). Thank You and I apologize for the mess
deft_code over 13 years

use c++'s std::bitset::count. after inlining this compiles to a single __builtin_popcount call.
NikiC over 13 years

No, the best solution isn't the one most readable in this case. Here the best algorithm is the fastest one.
paxdiablo over 13 years

That's entirely your opinion, @nikic, although you're free to downvote me, obviously. There was no mention in the question as to how to quantify "best", the words "performance" or "fast" can be seen nowhere. That's why I opted for readable.
NikiC over 13 years

@paxdiablo: Right as you are I changed my vote. I haven't read the question carefully enough. (I had to make a ghost-edit so I can change the vote. Hopefully you're okay with that.)
Matt Joiner over 13 years

The best non-assembler form like this I've seen unrolled 24 32bit words at a time. dalkescientific.com/writings/diary/popcnt.c, stackoverflow.com/questions/3693981/…, dalkescientific.com/writings/diary/archive/2008/07/05/…
benzado over 13 years

Not really an algorithm, this is just a library call. Useful for Java, not so much for everybody else.
Robert S. Barnes about 13 years

Is there a portable version of this code? For instance, how does it behave with 9 bit bytes or other unusual architectures?
Robert S. Barnes about 13 years

Not portable. What if the CPU has 9 bit bytes? Yes, there are real CPU's like that out there...
Saurabh almost 13 years

@Robert S. Barnes, this function will still work. It makes no assumption about native word size, and no reference to "bytes" at all.
Saurabh almost 13 years

@benzado is right but +1 anyway, because some Java developers might not be aware of the method
R.. GitHub STOP HELPING ICE almost 13 years

@nonnb: Actually, as written, the code is buggy and needs maintenance. >> is implementation-defined for negative values. The argument needs to be changed (or cast) to unsigned, and since the code is 32-bit-specific, it should probably be using uint32_t.
Nico over 12 years

I took the “write-only” bit, true as it was, as a challenge, and set to decode the code. I got about halfway; I'm not sure exactly what the additions and multiplication do in the context of this function. I've edited the answer to include the explanation as far as I could get it; I invite anyone smarter than me to finish it.
Matt Howells over 12 years

@Peter Hosey: Sorry, but I feel your comments in the code don't add much value and perhaps the explanation would be better in a comment.
Raymond Chenon over 12 years

after reading others, it's similar to paxdiablo's answer . I agree on "readability over cleverness any time".
Simon MᶜKenzie almost 12 years

Is there something special about this implementation? The accepted answer is obviously much more efficient than your answer, so how is this a "best" solution (as requested in the question)?
Admin over 11 years

I like very much your plug-in, polymorphic approach, as well as the switch to build as a reusable library or stand-alone, test executable. Very well thought =)
dash-tom-bang over 11 years

It's not really magic. It's adding sets of bits but doing so with some clever optimizations. The wikipedia link given in the answer does a good job of explaining what's going on but I'll go line by line. 1) Count up the number of bits in every pair of bits, putting that count in that pair of bits (you'll have 00, 01, or 10); the "clever" bit here is the subtract that avoids one mask. 2) Add pairs of those sums of bitpairs into their corresponding nibbles; nothing clever here but each nibble will now have a value 0-4. (cont'd)
dash-tom-bang over 11 years

3) does too much on one line, but up to the mask, the nibbles are added up into bytes, then the multiply adds all of the bytes into the high byte, which is then shifted down, leaving the result.
dash-tom-bang over 11 years

Another note, this extends to 64 and 128 bit registers by simply extending the constants appropriately. Interestingly (to me), those constants are also ~0 / 3, 5, 17, and 255; the former three being 2^n+1. This all makes more sense the more you stare at it and think about it in the shower. :)
greggo over 11 years

Another note, there's complaints above about >> extension being unspecified for negative ints, however, all of the results from >> are masked in such a way as to discard the extended bits .. except for the final >> which is OK since the upper 2 bits of its input are mathematically guaranteed to be zero. This may apply to the Jave concern as well.
underrun about 11 years

oh i like that. how bout the python version: def f(i, d={0:lambda:0, 1:lambda:1, 2:lambda:1, 3:lambda:2}): return d.get(i, lambda: f(i//4) + f(i%4))()
fIwJlxSzApHEZIl about 11 years

What if our input is a byte? I don't believe it's working for me when I cast my input from a byte to an int.
Vallabh Patade about 11 years

When sun provided different APIs, it must be using some logic on background, right?
Lefteris E almost 11 years

This algorithm is the version Matt Howells posted, before being optimized to the fact that it became unreadable.
Lefteris E almost 11 years

The solution by bcdabcd987 is the same algorithm before being optimized to the point of becoming illegible...
Lefteris E almost 11 years

mpi-inf.mpg.de/departments/rg1/teaching/advancedc-ws08/scrip‌t/…
nlucaroni almost 11 years

the intrinsic mentioned (_popcnt32/64) is located in immintrin.h and available if one has the POPCNT CPUID Feature Flag. It's not really part of SSE --at least that is my interpretation of the information in the Intrinsics Guide 3.0.1 provided by Intel)
Nils Pipenbrinck almost 11 years

@nlucaroni Well, yes. Times are changing. I've wrote this answer in 2008. Nowadays we have native popcount and the intrinsic will compile down to a single assembler statement if the platform allows that.
chux - Reinstate Monica over 10 years

Negative values of n always return 0.
chux - Reinstate Monica over 10 years

What is the c = about? Looks like is should be eliminated. Further, suggest an extra paren set A"(((v + (v >> 4)) & 0xF0F0F0F) * 0x1010101) >> 24" to avoid some classic warnings.
chux - Reinstate Monica over 10 years

An important feature is that this 32-bit routine works for both popcount(int v) and popcount(unsigned v). For portability, consider popcount(uint32_t v), etc. Really like the *0x1010101 part.
chux - Reinstate Monica over 10 years

In bitCount(), the for loop never terminates when n < 0.
waka-waka-waka over 10 years

I am reading this answer 3 years later, and I find it as the best answer because it is readable and has more comments. period.
neevek over 10 years

@finnw, i am one of those developers. :)
Jingguo Yao over 10 years

For a explanation of this algorithm, refer to page 179-180 of Software Optimization Guide for AMD Athlon™ 64 and Opteron™ Processors.
Kevin Cox about 10 years

Note that this is a microbenchmark and should be taken with a grain of salt. For example, setting up the lookup table just before using it is priming the cache in a way that may be rare in real code.
Apriori almost 10 years

@R: greggo is correct. If ones are shifted in then they are masked away. The final shift follows a multiplication summing every byte into the high byte. Since the high byte is summing bits set, it will never exceed 32 and only requires at most 6 bits. So the high bit will never be set and the number will never be negative.
Nopik over 9 years

This solution seem to have minor problem, related to operator precedence. For each term it should say: x = (((x >> 1) & 0b01010101010101010101010101010101) + (x & 0b01010101010101010101010101010101)); (i.e. extra parens added).
tom over 9 years

Four years later and this is the one I'd use. Mainly because I can understand it enough to replicate it. Good answer.
Marco Bolis over 9 years

The Java method Integer.bitCount(int) uses this same exact implementation.
Marco Bolis over 9 years

As a side note, Java's implementation uses the same algorithm pointed out by Kevin Little.
keyser about 9 years

I find it weird that this answer isn't closer to this one in terms of upvotes. Performance-wise it seems really stable to depend on the number of bits set. Maybe I misunderstood just how fast this one can be.
Sambatyon about 9 years

Sorry but lookup tables are slow.
paxdiablo about 9 years

@Sambatyon, I'd take your contention more seriously if you provided actual support for it.
Jeremy Blum about 9 years

Having a little trouble following this - how would it change if we only cared about 16-bit values, instead of 32-bit?
Maarten Bodewes about 9 years

For Java you can just call Integer.bitCount(int) (since 1.5 but that's a while back already).
Maarten Bodewes about 9 years

Maybe hackers delight is delightful, but I would give a good kicking to anybody calling this pop instead of population_count (or pop_cnt if you must have an abreviation). @MarcoBolis I presume that will be true of all versions of Java, but officially that would be implementation dependent :)
v.oddou about 9 years

sauce ? (book, link, invetors's names etc) would be VERY welcomed. Because then we can paste that in our codebases with a comment to where it comes from.
imallett almost 9 years

Unfortunately, the bit counting isn't done in parallel, so it's probably slower. Might make a nice constexpr though.
pentaphobe over 8 years

Agreed - it was a fun exercise in C++ template recursion, but definitely a pretty naïve solution.
greybeard over 8 years

(cram set bits to the low end, invert, count unset bits from the low end)
Michael over 8 years

Unfortunately, call/return can be costly, so for inner loops I would prefer an inline version of Matt's NumberOfSetBits().
user924272 over 8 years

Ironically, that table could have been created by any of the algorithms posted in this thread! Nevertheless, using tables like this means constant-time performance. Going one step further and creating a 64K translation table would therefore halve the AND, SHIFT and ADD operations necessary. An interesting subject for bit manipulators!
David over 8 years

I like this method. It's not that hard to understand, there is only one like that really needs any comment. And when you understand that one line it is like a little gem that you have been given. (What was in my coffee this morning?) I have used this without the loop to check if only one single flag bit is set - i.e. process one way if only one flag bit is set, some other (more tedious) way if multiple flags are set. (edit: formatting)
emem about 8 years

I think for better clarity the last line should be written as: return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24; so we don't need to count letters to see what you are actually doing (since you discarded the first 0, I accidentally thought you used the wrong (flipped) bit pattern as mask - that is until I noted there are only 7 letters and not 8).
Lewis Diamond almost 8 years

You can also generate a neat lookup table using variadic templates in C++. Still doesn't perform as well as the builtin though. bitbucket.org/ldiamond/popcount/src
C.J. almost 8 years

Much better than all the other answers. Clear, straightforward and comprehensible by humans without the aid of a debugger.
gexicide almost 8 years

Sounds like the javascript way. I would suggest using a webservice instead!
corsiKa almost 8 years

@JasonS If you cite where you get it from, for example Numerical Recipes or Art of Programming, and someone still denies it from code review as unmaintainable, that person shouldn't be reviewing code. There are some bibles/oracles we simply trust as sources of good algorithms so we don't go reinventing things ourselves.
Jason S almost 8 years

Even with citations, that doesn't mean the code has been copied correctly, or that the original code did not contain errors for edge cases.
Den Roman over 7 years

in case of big lookup tables it might really be slower than runtime computation because of CPU cache miss, but these ones do not look big
Jean-François Fabre about 7 years

(int) Math.pow(2, power) is that a joke?
greggo about 7 years

It's not hard to write a test suite that just goes through all 2**32 possible inputs and compares to a transparent reference implementation. Given that there are no loops or conditionals, no variable shifts (and thus no room for UD behavior, if you are using unsigned) that counts as proof. Not so much for the 64-bit version though. Incidentally, __builtin_popcount() (gcc,clang) will generate something very much like this when there is no popcount instruction.
greggo about 7 years

Readability? int cnt=0; for( int b=0; b < 32; b++) if( (val>>b) & 1) cnt++; But I think this question isn't that interesting until speed is a priority.
greggo about 7 years

Bigger tables can be slower (and not constant-time) due to cache issues. You can 'look up' 3 bits at a time with (0xe994 >>(k*2))&3,without memory access...
greggo about 7 years

Regarding optimization of %2 and /2: n%2 and n/2 are only equivalent to n&1 and n>>1 when n >=0. So, with unsigned it's fine (and the example is only really correct with unsigned, anyway). In this example, even with int, the compiler could prove that those ops are never reached unless n>=0, so it's not a reflection on the general case; if compiler can't prove n>=0, you will get something like (n -(n>>31))>>1 for the n/2, and similar weirdness for the n%2. In general case, if( (n%2)!=0) could optimize to (n&1)!=0 but if( (n%2)==1) cannot.
divillysausages almost 7 years

Implementation aside, this is probably the clearest message of intent for developers maintaining your code after you (or when you come back to it 6 months later)
Alex almost 7 years

And, this requires no multiplications, like the code in the accepted answer.
Peter Cordes over 6 years

@Michael: __builtin functions aren't real functions that get called. If compiling for a target that supports it as a single instruction (e.g. x86 with -mpopcnt), it inlines to just that. However, without that gcc may emit a call to a libgcc.a function like __popcountdi2 instead of inlining those instructions. It's up to the compiler, though; clang4.0 chooses to inline, just like it does for std::bitset.count(). godbolt.org/g/TueMQt
BeeOnRope over 6 years

Just to update the "state of popcnt instruction performance" mentioned in some of the above comments, the last several generations of Intel CPUs have been able to issue 1 popcnt per cycle, with a latency of 3 cycles, and AMD Zen architecture can issue 4 (!!) popcnt instructions per cycle with a latency of 1 cycle. So we can say that popcnt is "really fast" on modern hardware, and in AMD's case as fast as trivial instructions like or and add.
Peter Cordes over 6 years

Fun, but of no practical value. All BMI2 CPUs have popcnt. pext same,same to pack the bits could be an interesting building-block for something else, but tzcnt and pext both run on the same port as popcnt on Intel CPUs, and pext is very slow on AMD. (agner.org/optimize). You can sort of emulate pext x,x with (1ULL << popcnt(x)) - 1, except for the x==0 case. x86 shifts can't shift out all the bits, because they mask the shift count, and you have to watch out for C undefined behaviour with out of range counts.
Peter Cordes over 6 years

Buggy: godbolt.org/g/YuN5cA returns 131083 for n=1234667. You need to combine the two 8-bit chunks that the last step leaves, and clear the high bits. Also, it's apparently possible to be more efficient than this SWAR 1-bit, 2-bit, 4-bit sequence. I haven't grokked the magic of the top answer's bit-twiddling hack, though. stackoverflow.com/questions/109023/…
George Koehler over 6 years

That multiplication by 0x01010101 might be slow, depending on processor. For example, in my old PowerBook G4, 1 multiplication was about as slow as 4 additions (not as bad as division, where 1 division was about as slow as 23 additions).
ealfonso over 6 years

this is good for "sparse" numbers with a low number of bits, as it is O(ONE-BITS). It is indeed O(1) since there are at most 32 one-bits.
Bergi about 6 years

How is this different from this existing answer?
greybeard about 6 years

Both without a mention of S.E. Anderson's Bit Twiddling Hacks or B. Kernighan. Sigh.
Aiken Drum about 6 years

Downvote. Readability is NOT an issue in unifunctional leaf code that needs no maintenance. Performance IS.
paxdiablo about 6 years

@AikenDrum, you're entitled to your opinion but mine is that pretty much all code eventually needs maintenance of some description. I pray I never have to maintain yours :-)
Aiken Drum about 6 years

If you had to read any of my code that was written for perf over readability, you would find it well-commented to compensate. If your LoB involves maximum throughput, you don't sacrifice the bottom line and a lot of customers' time to save a single future programmer time figuring out (or reading) how your algorithm counted the number of set bits in a value. Have you not worked in more than one level of code, or more than one job? No one should think that such an exceedingly general rule of thumb can be applied in all cases. Learn some decorum before you go insulting people's skills on SE.
Jibb Smart about 6 years

I would suspect the bottleneck in the highly readable version would be the if branch, which would be unpredictable. Ponkadoodle's suggestion is almost as readable as the original, but removes the branch -- low hanging fruit that's still easy to read should be picked, IMHO. But anyway, I'm glad there's an answer that prioritises readability, even if that's not what I'd look for with a problem like this.
paxdiablo about 6 years

@AikenDrum, there is absolutely nothing in the question (or question history for that matter) that asks for performance. An earlier version asked for the best way (unwise since "best" was not defined) and my opinion has always been that readability trumps performance unless there are specific requirements. You should also learn to read smileys, that comment about maintaining your code was a gentle jab rather than an intended insult. My regret at doing so is tempered by the fact you appear to have retaliated in kind, implying a lack of experience on my part but I don't get offended by such.
paxdiablo about 6 years

I made it quite clear that the answer was heavily based on my opinions and preferences, and I saw little need to rehash the performance-based answers already in play. Earlier comments have alredy covered this ground and, apparently, 172 people agreed though, admittedly, that's four fifths of bugger-all of the SO community so may not carry that much weight :-) Bottom line, I have no issue with the way you vote, I just wanted to make sure you et al at least understood why I gave the answer I gave. You may have the last word if you wish, I think I've explained as best I can.
Clearer about 6 years

So, first do an expensive conversion and then do a set of expensive comparisons? Sounds like a very slow method.
Fabian almost 6 years

Note that the function bitCount() cannot be used for signed int because if the sign bit is set, the while loop does never end because in some implementations, the sign bit is added instead of zeroes.
paxdiablo almost 6 years

@Fabian, hence the unsigned in the function prototype :-)
user2297550 almost 6 years

i want to upvote this question SOLELY because it elicited the comment from @JohannesSchaub-litb but refraining since the passing neo-luddite would think i'm upvoting on merits.
user2297550 almost 6 years

even if i take pains to see your viewpoint, here is the problem: You state "the one that can be read by another programmer (or the original programmer two years later)" and then you write while (value > 0) .. and if ((value & 1) == 1) .. That implies that by "another programmer" you truly mean "another programmer of another language" or "original programmer who forgot C in the meanwhile" while tempting the casual reader to think you are advocating simpler algorithms. It is IDIOMATIC & easier to read while (value) .. and if (value & 1) .. Your true position is "code as would morons".
user2297550 almost 6 years

unsigned int count_bits (unsigned int val) { unsigned int n; for (n = 0; val; val >>= 1) n += val & 1; return n; } There .. was that so hard? And I stopped coding in C about 15 years ago, and it's easier to read. Going by your reputation the only rationale I can think of for your original answer is that you were trolling and establishing some kind of a moron honeypot to warn everybody else, lol.
Albert van der Horst about 5 years

Never use shift operators and masks on signed quantities.
Albert van der Horst about 5 years

Note that in generalizing to 64-bit there is a problem. The result cannot be 64, because of the mask.
Adrian Mole over 4 years

This is essentially Brian Kernighan's (remember him?) algorithm, with the minor change that he used the more succinct n &= (n-1) form.
Ronald Souza over 4 years

The 2 statements within the loop (namely, "if (n % 2) == 1" and "count += 1") could be replaced by a single, faster one: "count += (n % 2) == 1", leveraging the C++ support to implicit cast between bool and int. Interesting answer anyway. Upvoted.
S.S. Anne over 4 years

This only counts the number of initial set bits.
Doin over 4 years

If implementing this in Javascript, change the first line to i = (i|0) - ((i >> 1) & 0x55555555); The |0 coerces the value to an int32, giving a significant speed-up. Also using >>> over >> seems slightly faster (although as previous comments have pointed out, both will in fact work correctly in this algorithm). See performance tests here: jsperf.com/fast-int32-bit-count
Shree Harsha over 4 years

Don't use an axe to cut your nails when you have a nailclipper.
kevinarpe almost 4 years

I just noticed that C++-20 now has std::popcount() here: en.cppreference.com/w/cpp/numeric/popcount However, I prefer this answer because it also handles signed values. :)
Robur_131 almost 4 years

Is the complexity of this code O(floor(log2(num))/4), assuming num can be as arbitrarily large as possible? Because the while loop runs as long as there's a nibble to process? There are floor(log2(num)) bits and floor(log2(num)) / 4 nibbles. Is the reasoning correct?
BathNinja over 3 years

^^^ Overzelous programmers.
Thomas over 3 years

You might also see this talk by Matt Godbolt, which touches on exactly the same topic: youtu.be/bSkpMdDe4g4?t=2378
Peter Cordes over 3 years

@Thomas: Yup, I've seen it thanks. I left a comment on the video years ago (youtube.com/…), and linked it in prominently in How to remove "noise" from GCC/clang assembly output?. But yes, highly recommend it to future readers.
Peter Cordes about 3 years

@Alex: The first part of this is the same bithack as the accepted answer, it simply uses multiple two shift+add steps to horizontally sum 4 bytes into 1, instead of one multiply by a 0x01010101 constant to sum all 4 bytes into the high byte. (I would have done the x>>16 part first, to make the reduction work by narrowing the valuable part in half twice.) On a CPU without a fast multiplier, yes this is preferable (especially if it can do large shifts efficiently, not like 1 bit per cycle. Or if it has some other way to extract the high half; e.g. on a 16-bit CPU it would already be split
Peter Cordes about 3 years

@GeorgeKoehler: Multiply usually has higher latency than addition, but for throughput it's often still a single slot in the pipeline. So out-of-order exec can still overlap this with the surrounding code. But the alternative for summing 4 bytes into 1 is two shifts and 2 adds, as in the Hacker's Delight answer which is the same bithack as this and the accepted answer, up until the final reduction step.
Peter Cordes about 3 years

This would be clearer if the 0x0F0F0F0F was on a separate line, like v = v + (v >> 4) & 0xF0F0F0F; // sum nibbles to bytes. There's no benefit or clarity to jamming it onto the same line as the bytes -> int horizontal sum. It would also make the c = ... stray bit of code stand out more :P
Mark Ransom almost 3 years

@Robur_131 I don't see anything wrong with your reasoning, except that big-O doesn't care about constant factors so you could simplify to just O(log n). The nice thing about this algorithm is it doesn't always take worst case, if the upper bits are zero it exits early. In fact for an input of zero the loop doesn't run at all.
Peter Cordes over 2 years

@greggo: Even better, GCC10 and clang 10.0 both compile this to a popcnt instruction when targeting x86. So assuming their idiom recognizers for this are non-buggy, that's rock-solid proof of correctness.
Mark Ransom over 2 years

I was rather disappointed to find that Python's bit_count wasn't introduced until 3.10. I'm stuck at 3.8 so I can't use it.
Glenn Slayden over 2 years

In case you're confused, the error in the original article that @Nopik pointed out has since been fixed (by someone else), and without newly introducing extraneous parentheses as the comment suggests.
Peter Cordes almost 2 years

@LefterisE: Except using shift/add all the way to the end, instead of using a multiply to sum 8-bit chunks into the top 8 bits, replacing the last 2 x=... lines here. My edits demangled it some (keeping the optimized logic, improving readability and adding comments), and I added a section that explains the SWAR bithacks involved. This answer is still useful, though, at least for the diagram. Or for a hypothetical 32-bit machine with a slow multiply instruction.
Peter Cordes almost 2 years

@MarkRansom: So unless you're optimizing for inputs of zero, you should probably change it to a do{}while loop that doesn't compare/branch until after checking the low 4 bits. Or unroll to check 2x 4 bits on the first iteration. That means the branch-prediction pattern is the same for all inputs from 0..255, and making the branching more predictable is often a Good Thing, when the work saved is much cheaper than a branch mispredict. (Of course that depends on the CPU, and you wouldn't typically use this on a high-end CPU where branch misses are most expensive.)