Binary grep on Linux?

40,217

Solution 1

One-Liner Input

Here’s the shorter one-liner version:

% perl -ln0e 'print tell' < inputfile

And here's a slightly longer one-liner:

% perl -e '($/,$\) = ("\0","\n"); print tell while <STDIN>' < inputfile

The way to connect those two one-liners is by uncompiling the first one’s program:

% perl -MO=Deparse,-p -ln0e 'print tell'
BEGIN { $/ = "\000"; $\ = "\n"; }
LINE: while (defined(($_ = <ARGV>))) {
    chomp($_);
    print(tell);
}

Programmed Input

If you want to put that in a file instead of a calling it from the command line, here’s a somewhat more explicit version:

#!/usr/bin/env perl

use English qw[ -no_match_vars ];

$RS  = "\0";    # input  separator for readline, chomp
$ORS = "\n";    # output separator for print

while (<STDIN>) {
    print tell();
}

And here’s the really long version:

#!/usr/bin/env perl

use strict;
use autodie;  # for perl5.10 or better
use warnings qw[ FATAL all  ];

use IO::Handle;

IO::Handle->input_record_separator("\0");
IO::Handle->output_record_separator("\n");

binmode(STDIN);   # just in case

while (my $null_terminated = readline(STDIN)) {
    # this just *past* the null we just read:
    my $seek_offset = tell(STDIN);
    print STDOUT $seek_offset;  

}

close(STDIN);
close(STDOUT);

One-Liner Output

BTW, to create the test input file, I didn’t use your big, long Python script; I just used this simple Perl one-liner:

% perl -e 'print 0.0.0.0.2.4.6.8.0.1.3.0.5.20' > inputfile

You’ll find that Perl often winds up being 2-3 times shorter than Python to do the same job. And you don’t have to compromise on clarity; what could be simpler that the one-liner above?

Programmed Output

I know, I know. If you don’t already know the language, this might be clearer:

#!/usr/bin/env perl
@values = (
    0,  0,  0,  0,  2,
    4,  6,  8,  0,  1,
    3,  0,  5, 20,
);
print pack("C*", @values);

although this works, too:

print chr for @values;

as does

print map { chr } @values;

Although for those who like everything all rigorous and careful and all, this might be more what you would see:

#!/usr/bin/env perl

use strict;
use warnings qw[ FATAL all ];
use autodie;

binmode(STDOUT);

my @octet_list = (
    0,  0,  0,  0,  2,
    4,  6,  8,  0,  1,
    3,  0,  5, 20,
);

my $binary = pack("C*", @octet_list);
print STDOUT $binary;

close(STDOUT); 

TMTOWTDI

Perl supports more than one way to do things so that you can pick the one that you’re most comfortable with. If this were something I planned to check in as school or work project, I would certainly select the longer, more careful versions — or at least put a comment in the shell script if I were using the one-liners.

You can find documentation for Perl on your own system. Just type

% man perl
% man perlrun
% man perlvar
% man perlfunc

etc at your shell prompt. If you want pretty-ish versions on the web instead, get the manpages for perl, perlrun, perlvar, and perlfunc from http://perldoc.perl.org.

Solution 2

This seems to work for me:

grep --only-matching --byte-offset --binary --text --perl-regexp "<\x-hex pattern>" <file>

Short form:

grep -obUaP "<\x-hex pattern>" <file>

Example:

grep -obUaP "\x01\x02" /bin/grep

Output (Cygwin binary):

153: <\x01\x02>
33210: <\x01\x02>
53453: <\x01\x02>

So you can grep this again to extract offsets. But don't forget to use binary mode again.

Solution 3

Someone else appears to have been similarly frustrated and wrote their own tool to do it (or at least something similar): bgrep.

Solution 4

The bbe program is a sed-like editor for binary files. See documentation.

Example with bbe:

bbe -b "/\x00\x00\xCC\x00\x00\x00/:17" -s -e "F d" -e "p h" -e "A \n" mydata.bin

11:x00 x00 xcc x00 x00 x00 xcd x00 x00 x00 xce

Explanation

-b search pattern between //. each 2 byte begin with \x (hexa notation).
   -b works like this /pattern/:length (in byte) after matched pattern
-s similar to 'grep -o' suppress unmatched output 
-e similar to 'sed -e' give commands
-e 'F d' display offsets before each result here: '11:'
-e 'p h' print results in hexadecimal notation
-e 'A \n' append end-of-line to each result

You can also pipe it to sed to have a cleaner output:

bbe -b "/\x00\x00\xCC\x00\x00\x00/:17" -s -e "F d" -e "p h" -e "A \n" mydata.bin | sed -e 's/x//g'

11:00 00 cc 00 00 00 cd 00 00 00 ce

Your solution with Perl from your EDIT3 give me an 'Out of memory' error with large files.

The same problem goes with bgrep.

The only downside to bbe is that I don't know how to print context that precedes a matched pattern.

Solution 5

One way to solve your immediate problem using only grep is to create a file containing a single null byte. After that, grep -abo -f null_byte_file target_file will produce the following output.

0:
1:
2:
3:
8:
11:

That is of course each byte offset as requested by "-b" followed by a null byte as requested by "-o"

I'd be the first to advocate perl, but in this case there's no need to bring in the extended family.

Share:
40,217
sdaau
Author by

sdaau

Updated on December 17, 2020

Comments

  • sdaau
    sdaau over 3 years

    Say I have generated the following binary file:

    # generate file:
    python -c 'import sys;[sys.stdout.write(chr(i)) for i in (0,0,0,0,2,4,6,8,0,1,3,0,5,20)]' > mydata.bin
    
    # get file size in bytes
    stat -c '%s' mydata.bin
    
    # 14
    

    And say, I want to find the locations of all zeroes (0x00), using a grep-like syntax.

     

    The best I can do so far is:

    $ hexdump -v -e "1/1 \" %02x\n\"" mydata.bin | grep -n '00'
    
    1: 00
    2: 00
    3: 00
    4: 00
    9: 00
    12: 00
    

    However, this implicitly converts each byte in the original binary file into a multi-byte ASCII representation, on which grep operates; not exactly the prime example of optimization :)

    Is there something like a binary grep for Linux? Possibly, also, something that would support a regular expression-like syntax, but also for byte "characters" - that is, I could write something like 'a(\x00*)b' and match 'zero or more' occurrences of byte 0 between bytes 'a' (97) and 'b' (98)?

    EDIT: The context is that I'm working on a driver, where I capture 8-bit data; something goes wrong in the data, which can be kilobytes up to megabytes, and I'd like to check for particular signatures and where they occur. (so far, I'm working with kilobyte snippets, so optimization is not that important - but if I start getting some errors in megabyte long captures, and I need to analyze those, my guess is I would like something more optimized :) . And especially, I'd like something where I can "grep" for a byte as a character - hexdump forces me to search strings per byte)

    EDIT2: same question, different forum :) grepping through a binary file for a sequence of bytes

    EDIT3: Thanks to the answer by @tchrist, here is also an example with 'grepping' and matching, and displaying results (although not quite the same question as OP):

    $ perl -ln0777e 'print unpack("H*",$1), "\n", pos() while /(.....\0\0\0\xCC\0\0\0.....)/g' /path/to/myfile.bin
    
    ca000000cb000000cc000000cd000000ce     # Matched data (hex)
    66357                                  # Offset (dec)
    

    To have the matched data be grouped as one byte (two hex characters) each, then "H2 H2 H2 ..." needs to be specified for as many bytes are there in the matched string; as my match '.....\0\0\0\xCC\0\0\0.....' covers 17 bytes, I can write '"H2"x17' in Perl. Each of these "H2" will return a separate variable (as in a list), so join also needs to be used to add spaces between them - eventually:

    $ perl -ln0777e 'print join(" ", unpack("H2 "x17,$1)), "\n", pos() while /(.....\0\0\0\xCC\0\0\0.....)/g' /path/to/myfile.bin
    
    ca 00 00 00 cb 00 00 00 cc 00 00 00 cd 00 00 00 ce
    66357
    

    Well.. indeed Perl is very nice 'binary grepping' facility, I must admit :) As long as one learns the syntax properly :)