Fastest way to scan for bit pattern in a stream of bits

c++ c algorithm assembly embedded

25,149

Solution 1

Using simple brute force is sometimes good.

I think precalc all shifted values of the word and put them in 16 ints so you got an array like this (assuming int is twice as wide as short)

 unsigned short pattern = 1234;
 unsigned int preShifts[16];
 unsigned int masks[16];
 int i;
 for(i=0; i<16; i++)
 {
      preShifts[i] = (unsigned int)(pattern<<i);  //gets promoted to int
      masks[i] = (unsigned int) (0xffff<<i);
 }

and then for every unsigned short you get out of the stream, make an int of that short and the previous short and compare that unsigned int to the 16 unsigned int's. If any of them match, you got one.

So basically like this:

  int numMatch(unsigned short curWord, unsigned short prevWord)
  {
       int numHits = 0;
       int combinedWords = (prevWord<<16) + curWord;

       int i=0;
       for(i=0; i<16; i++)
       {
             if((combinedWords & masks[i]) == preShifsts[i]) numHits++;
       }
       return numHits;
  }

Do note that this could potentially mean multiple hits when the patterns is detected more than once on the same bits:

e.g. 32 bits of 0's and the pattern you want to detect is 16 0's, then it would mean the pattern is detected 16 times!

The time cost of this, assuming it compiles approximately as written, is 16 checks per input word. Per input bit, this does one & and ==, and branch or other conditional increment. And also a table lookup for the mask for every bit.

The table lookup is unnecessary; by instead right-shifting combined we get significantly more efficient asm, as shown in another answer which also shows how to vectorize this with SIMD on x86.

Solution 2

Here is a trick to speed up the search by a factor of 32, if neither the Knuth-Morris-Pratt algorithm on the alphabet of two characters {0, 1} nor reinier's idea are fast enough.

You can first use a table with 256 entries to check for each byte in your bit stream if it is contained in the 16-bit word you are looking for. The table you get with

unsigned char table[256];
for (int i=0; i<256; i++)
  table[i] = 0; // initialize with false
for (i=0; i<8; i++)
  table[(word >> i) & 0xff] = 1; // mark contained bytes with true

You can then find possible positions for matches in the bit stream using

for (i=0; i<length; i++) {
  if (table[bitstream[i]]) {
    // here comes the code which checks if there is really a match
  }
}

As at most 8 of the 256 table entries are not zero, in average you have to take a closer look only at every 32th position. Only for this byte (combined with the bytes one before and one after) you have then to use bit operations or some masking techniques as suggested by reinier to see if there is a match.

The code assumes that you use little endian byte order. The order of the bits in a byte can also be an issue (known to everyone who already implemented a CRC32 checksum).

Solution 3

I would like to suggest a solution using 3 lookup tables of size 256. This would be efficient for large bit streams. This solution takes 3 bytes in a sample for comparison. Following figure shows all possible arrangements of a 16 bit data in 3 bytes. Each byte region has shown in different color.

alt text http://img70.imageshack.us/img70/8711/80541519.jpg

Here checking for 1 to 8 will be taken care in first sample and 9 to 16 in next sample and so on. Now when we are searching for a Pattern, we will find all the 8 possible arrangements (as below) of this Pattern and will store in 3 lookup tables (Left, Middle and Right).

Initializing Lookup Tables:

Lets take an example 0111011101110111 as a Pattern to find. Now consider 4th arrangement. Left part would be XXX01110. Fill all raws of Left lookup table pointing by left part (XXX01110) with 00010000. 1 indicates starting position of arrangement of input Pattern. Thus following 8 raws of Left look up table would be filled by 16 (00010000).

Middle part of arrangement would be 11101110. Raw pointing by this index (238) in Middle look up table will be filled by 16 (00010000).

Now Right part of arrangement would be 111XXXXX. All raws (32 raws) with index 111XXXXX will be filled by 16 (00010000).

We should not overwrite elements in look up table while filling. Instead do a bitwise OR operation to update an already filled raw. In above example, all raws written by 3rd arrangement would be updated by 7th arrangement as follows.

Thus raws with index XX011101 in Left lookup table and 11101110 in Middle lookup table and 111XXXXX in Right lookup table will be updated to 00100010 by 7th arrangement.

Searching Pattern:

Take a sample of three bytes. Find Count as follows where Left is left lookup table, Middle is middle lookup table and Right is right lookup table.

Count = Left[Byte0] & Middle[Byte1] & Right[Byte2];

Number of 1 in Count gives the number of matching Pattern in taken sample.

I can give some sample code which is tested.

Initializing lookup table:

    for( RightShift = 0; RightShift < 8; RightShift++ )
    {
        LeftShift = 8 - RightShift;

        Starting = 128 >> RightShift;

        Byte = MSB >> RightShift;

        Count = 0xFF >> LeftShift;

        for( i = 0; i <= Count; i++ )
        {
            Index = ( i << LeftShift ) | Byte;

            Left[Index] |= Starting;
        }

        Byte = LSB << LeftShift;

        Count = 0xFF >> RightShift;

        for( i = 0; i <= Count; i++ )
        {
            Index = i | Byte;

            Right[Index] |= Starting;
        }

        Index = ( unsigned char )(( Pattern >> RightShift ) & 0xFF );

        Middle[Index] |= Starting;
    }

Searching Pattern:

Data is stream buffer, Left is left lookup table, Middle is middle lookup table and Right is right lookup table.

for( int Index = 1; Index < ( StreamLength - 1); Index++ )
{
    Count = Left[Data[Index - 1]] & Middle[Data[Index]] & Right[Data[Index + 1]];

    if( Count )
    {
        TotalCount += GetNumberOfOnes( Count );
    }
}

Limitation:

Above loop cannot detect a Pattern if it is placed at the very end of stream buffer. Following code need to add after loop to overcome this limitation.

Count = Left[Data[StreamLength - 2]] & Middle[Data[StreamLength - 1]] & 128;

if( Count )
{
    TotalCount += GetNumberOfOnes( Count );
}

Advantage:

This algorithm takes only N-1 logical steps to find a Pattern in an array of N bytes. Only overhead is to fill the lookup tables initially which is constant in all the cases. So this will be very effective for searching huge byte streams.

Solution 4

I would implement a state machine with 16 states.

Each state represents how many received bits conform to the pattern. If the next received bit conform to the next bit of the pattern, the machine steps to the next state. If this is not the case, the machine steps back to the first state (or to another state if the beginning of the pattern can be matched with a smaller number of received bits).

When the machine reaches the last state, this indicates that the pattern has been identified in the bit stream.

Solution 5

atomice's

Knuth-Morris-Pratt

looked good until I considered Luke and MSalter's requests for more information about the particulars.

Turns out the particulars might indicate a quicker approach than KMP. The KMP article links to

Boyer Moore

for a particular case when the search pattern is 'AAAAAA'. For a multiple pattern search, the

Rabin-Karp

might be most suitable.

You can find further introductory discussion here.

View more solutions

25,149

Author by

Admin

Updated on August 23, 2020

Comments

Admin over 3 years

I need to scan for a 16 bit word in a bit stream. It is not guaranteed to be aligned on byte or word boundaries.

What is the fastest way of achieving this? There are various brute force methods; using tables and/or shifts but are there any "bit twiddling shortcuts" that can cut down the number of calculations by giving yes/no/maybe contains the flag results for each byte or word as it arrives?

C code, intrinsics, x86 machine code would all be interesting.
Admin over 14 years

Emulating what an FPGA does in software may be an interesting place to start.
Admin over 14 years

This is more along the lines of what I was thinking and quite possibly as good as it will get, especially if SIMD intrinsic will allow multiple comparisons across multiple word. The "flag" is non-zero and unique within the bit stream.
Toad over 14 years

using anything like classes on the bitlevel makes it terribly slow. Bitstreams will be no exception, especially if you are going to compare very bit separately.
Toad over 14 years

Cool. If you use SIMD you could potentially search even faster by putting the pattern a few times 'next to eachother' in the huge registers. This way you can compare something like 4 unsigned int's at the same time.
ajs410 over 14 years

All the FPGA would do is shift bits and compare, it's just that a circuit doesn't care about byte alignment, and it can shift the next bit in while it's comparing against the current bit.
Toad over 14 years

still... these all relate to text searches and nothing on bitlevel. Turning the bitstream into a bytestream first might be very costly (memory/processortime) and impractical.
mouviciel over 14 years

This would be as fast as the input bit stream.
Ewan Todd over 14 years

Reinier, I'm concentrating on the string search algorithm here. You point our that the bitwise masking ops are not free. For now, I assume that they are comparably expensive for each algorithm. My main point, though, is that the specifics of the application may allow us to beat KMP.
Toad over 14 years

KMP only can test 1 letter at a time. With bits you can test 16 (or with SIMD 64) at a time. This would make KMP or any other letter based algorithm useless.
Toad over 14 years

your accumulator needs 3 input bytes in the init ;^)
Toad over 14 years

clever way to speed things up!
csakii over 14 years

Probably not as fast as the versions that try to match multiple versions of the pattern at once, but plenty fast for any practical application.
csakii over 14 years

I wouldn't bet against this without testing it. With C++ template expansion, this might well get optimized into something surprisingly fast.
Ewan Todd over 14 years

I don't understand your insistence that KMP is restricted to byte sized ops. I have in mind a bitwise KMP, which shifts right by one bit on mismatch.
Ewan Todd over 14 years

Google "Adapting the Knuth–Morris–Pratt algorithm for pattern matching in Huffman encoded texts"
Michael Burr over 14 years

I think the general idea is workable, but the details need some refinement. A pattern might span up to 3 bytes so the 'match' table would have to contain items that can hold 24 bit values (probably unsigned ints); you'd need to represent all of the possible values in the 'don't care' bits, so you'd need 16*256=4096 items your match table. A binary search might be in order. Finally you have to build the 3 byte value to look up on a byte-by-byte basis, not on a short-by-short basis. On the whole, I think that mouviciel's state machine approach would be simpler and possibly faster.
Toad over 14 years

michael: I don't think you understand the logic completely. The 16bit pattern is shifted 16 times into 16 unsigned ints creating every possible variation of the 16bit word. By getting 16 bits a time and masking this and checking it against the 16 possible patterns you have done all the checks you need.
Michael Burr over 14 years

This would likely be pretty fast an have the other advantage of being dead simple so less chance for boundary condition bugs. The only tricky part is handling partial matches - you have to reset the state 'back somewhere in the stream'. The simple method of just backing up the data stream could cause things to be slow for pathological streams (or patterns) where you get a lot of 15 bit matches. It's possible to build into the state machine the right state to restart matching without having to re-check bits, but that would make the state machine trickier to build (I think).
Michael Burr over 14 years

@reinier - I see - you're right. But the point about working byte-by-byte through the input stream stands (I think). Otherwise you'd miss the situation where a pattern spanned a short boundary in the input stream.
mouviciel over 14 years

This is tricky in the general case. But if the pattern is a constant of the problem, the state reset can be hardcoded. Or you can put the tricky part at initialisation step with a state-machine generation function taking the pattern as argument.
Toad over 14 years

@michael: No it won't miss it since I'm checking unsigned int's at a time. So it doesn't matter where the pattern of 16bits is shifted inside this 32 bit unsigned int
Michael Burr over 14 years

Another slightly tricky thing that needs to be handled (related to getting multiple matches) is to handle the 'remaining' bits in the stream after a match (assuming that you're interested in continuing the search for further matches). I think this just requires that the next matching loop start at the correct offset in the combinedWords/masks arrays (the one after the index where the current match was found, assuming you don't want to consider any of the bits in the current match for the next match). Not rocket science, but something that might not be obvious at first glance.
Michael Burr over 14 years

@reinier - damn- I should go back to sleep.
Admin over 14 years

Nice ideas in here. I'm going to mull over the problem for a little while and perhaps try this and the one from Reiner.
caf over 14 years

pattern should be declared 1234U, but other than that... this is pretty much what I would have done.
Luka Rahne over 14 years

what would you use fourier if this can be solved in O(n) time?
Toad over 14 years

mark: my experience is: always spell it out to the compiler and don't trust it's optimization magic too much in case of going for the optimal cycles. human brain beats compilers anytime.
Toad over 14 years

Trying do digest your code (not sure if I get it completely) I see one potential problem: if a byte is located at more than 1 place in the pattern then your lookuptable won't work since it can only store one place in the pattern. So given the byte: 00000000 and the pattern: 1000000000000011 it should give 5 locations, but it can only give 1, resulting in a possible miss of the pattern. Or am I missing something?
Ryu over 14 years

I'll represent bits in least to most significant order. match_offsets[0x00] = 4286594816. That is 00000000 11111100 00000001 11111111, the last 9 bits are not significant, the 6 set bits in the middle represent that the byte can match at pattern >> 1, pattern >> 2 .. pattern >> 6. (for reference least significant bit means that byte matches at pattern << 7 masking out 7 low bits, 8'th bit or bit7 means match at pattern << 0 or that the given byte equals the low byte of the pattern)
Toad over 14 years

great that solutions still keep coming in. By the way, your name sounds awesome. Could be the name of a character in a star wars movie ;^)
mouviciel over 13 years

@ziggystar: can you elaborate please?
ziggystar over 13 years

@mouviciel I'm sorry. I didn't read your answer properly. I've deleted my old wrong comment.
supercat over 13 years

Two optimizations I can see. First would be to check middle before looking up left and right, skipping the left and right checks if middle doesn't look good. A second optimization would be to use one 256x32-bit table rather than three 256x8 tables. This would replace three one-byte dereferences with one longword deference (the matching formula would be (byte2_lookup >> 16) & (byte1_lookup >> 8) & (byte0_lookup)). Note that if a lookup value is zero, one may be able to skip a byte provided one is prepared to backtrack upon finding a non-zero lookup value.
Ahmed A almost 10 years

I see an issue with the above algorithm, possible case of (false) double counting if t here is a pattern match at short byte boundary (alignment). E.g. if the pattern is xaabb and the stream is x1122aabb3344. The first call to numMatch(xaabb, x1122) will return 1. Second call to numMatch(x3344, xaabb) will return 1. For a total match of 2.
Toad almost 10 years

Since we precalc the pattern by shifting 0 bits upto and including 15 bits, I don't see how we can match the pattern which is 16 bits further in the stream as you suggest.
Peter Cordes about 5 years

Some platforms have int = short both being 16 bit. uint32_t would be a more portable choice, or even uint_least32_t for even more portability (to platforms that don't have a type that's exactly 32 bits).
Peter Cordes about 5 years

Is "raw" supposed to be "row"? Terminology seems strange, but the basic idea of lookup tables to see if a byte is part of a match at some offset seems good. We might be able to adapt this for SIMD lookups in 4-bit chunks, e.g. SSSE3 pshufb to do 16 lookups in parallel from a vector of bytes, using 4-bit indices.
Peter Cordes about 5 years

There's no need to pre-calc shifted mask and pattern, instead do combined >>= 1 inside the bit-loop. And test (combined & 0xFFFF) == pattern. Right shifts or bitfield extracts are very cheap, and on architectures without enough registers to keep 16 shifted patterns in 16 different registers, some compilers actually load or compare with memory. godbolt.org/z/kKEnWs Compilers unroll the right shift loop to a bitfield-extract on ISAs like ARM (ubfx), but on x86 compilers can just do a 16-bit compare of the low 16-bits of a register. (cmp cx,dx).
Peter Cordes about 5 years

Posted an answer with a version that does the same work but without arrays, which compiles to more efficient asm on x86 and most other ISAs. And @hplbsh: my answer also includes an AVX2 version that does the brute-force check of all 16 bit offsets in 8 uops + loop overhead.
Jeremy about 5 years

And if space is tight the table can be reduced to a 32-byte bit array.
Toad about 5 years

Looks like a great speed up. +1 for actually checking the assembly code. I always hear claims that 'the compiler is smart and will do something clever', but I've seen in the past that that is not always the case. Good to see that the current compilers are actually quite clever.
Peter Cordes about 5 years

@Toad: Yeah, compilers usually make code that's not terrible, but the level of faith in compilers that some advocate for is ridiculous. They often need some hand-holding! And yes, clang optimizing away a couple arrays is on the more-clever end of things. But that doesn't happen for all architectures. For some like x86, they auto-vectorize the array init and then use it in a scalar loop. Even ICC didn't manage to auto-vectorize the search itself. (And ICC's scalar loop isn't remotely clever; it's not great except at auto-vectorizing). So the AVX2 version had to be manually vectorized.