How to dump part of binary file

14,707

Solution 1

In a single pipe:

xxd -c1 -p file |
  awk -v b="ffd8ffd0" -v e="aaffd9" '
    found == 1 {
      print $0
      str = str $0
      if (str == e) {found = 0; exit}
      if (length(str) == length(e)) str = substr(str, 3)}
    found == 0 {
      str = str $0
      if (str == b) {found = 1; print str; str = ""}
      if (length(str) == length(b)) str = substr(str, 3)}
    END{ exit found }' |
  xxd -r -p > new_file
test ${PIPESTATUS[1]} -eq 0 || rm new_file

The idea is to use awk between two xxd to select the part of the file that is needed. Once the 1st pattern is found, awk prints the bytes until the 2nd pattern is found and exit.

The case where the 1st pattern is found but the 2nd is not must be taken into account. It is done in the END part of the awk script, which return a non-zero exit status. This is catch by bash's ${PIPESTATUS[1]} where I decided to delete the new file.

Note that en empty file also mean that nothing has been found.

Solution 2

Locate the start/end position, then extract the range.

$ xxd -g0 input.bin | grep -im1 FFD8FFD0  | awk -F: '{print $1}'
0000cb0
$ ^FFD8FFD0^AFFFD9^
0009590
$ dd ibs=1 count=$((0x9590-0xcb0+1)) skip=$((0xcb0)) if=input.bin of=output.bin

Solution 3

This should work with standard tools (xxd, tr, grep, awk, dd). This correctly handles the "pattern split across line" issue, also look for the pattern only aligned at byte offset (not nibble).

file=<yourfile>
outfile=<youroutputfile>
startpattern="ff d8 ff d0"
endpattern="af ff d9"
xxd -g0 -c1 -ps ${file} | tr '\n' ' ' > ${file}.hex 
start=$((($(grep -bo "${startpattern}" ${file}.hex\
    | head -1 | awk -F: '{print $1}')-1)/3))
len=$((($(grep -bo "${endpattern}" ${file}.hex\
    | head -1 | awk -F: '{print $1}')-1)/3-${start}))
dd ibs=1 count=${len} skip=${start} if=${file} of=${outfile}

Note: The script above use a temporary file to prevent having the binary>hex conversion twice. A space/time trade-off is to pipe the result of xxd directly into the two grep. A one-liner is also possible, at the expense of clarity.

One could also use tee and named pipe to prevent having to store a temporary file and converting output twice, but I'm not sure it would be faster (xxd is fast) and is certainly more complex to write.

Solution 4

See this link for a way to do binary grep. Once you have the start and end offset, you should be able with dd to get what you need.

Solution 5

Another solution in sed, but using less memory:

xxd -c1 -p file |
  sed -n -e '1{N;N;N}' -e '/ff\nd8\nff\nd0/{:begin;p;s/.*//;n;bbegin}' -e 'N;D' | 
  sed -n -e '1{N;N}' -e '/aa\nff\nd9/{p;Q1}' -e 'P;N;D' |
  xxd -r -p > new_file
test ${PIPESTATUS[2]} -eq 1 || rm new_file

The 1st sed prints from ff d8 ff d0 till the end of file. Note that you need as much N in -e '1{N;N;N}' as there is bytes in your 1st pattern less one.

The 2nd sed prints from the beginning of the file to aa ff d9. Note again that you need as much N in -e '1{N;N}' as there is bytes in your 2nd pattern less one.

Again, a test is needed to check if the 2nd pattern is found, and delete the file if it is not.

Note that the Q command is a GNU extension to sed. If you do not have it, you need to trash the rest of the file once the pattern is found (in a loop like the 1st sed, but not printing the file), and check after hex to binary conversion that the new_file end with the wright pattern.

Share:
14,707
theta
Author by

theta

Updated on June 27, 2022

Comments

  • theta
    theta about 2 years

    I have binary and want to extract part of it, starting from know byte string (i.e. FF D8 FF D0) and ending with known byte string (AF FF D9)

    In the past I've used dd to cut part of binary file from beginning/ending but this command doesn't seem to support what I ask.

    What tool on terminal can do this?

  • theta
    theta over 12 years
    I found "..count=$((0x9590-0xcb0+2)) skip=$((0xcb0+1))..." to match exactly starting from "FFD8.." and ending to "AFFF..". Thank you for your nice procedure. Cheers
  • theta
    theta over 12 years
    After couple of extractions I noticed that this is only approximate solution. +1, +2 all depend on content. For example 007d820: 74290068656c6c6f2e6a706700ffd8ff gives 007d820 for both '74 29 00 68' and '00 ff d8 ff' so something slightly different has to be done
  • Laurent Grégoire
    Laurent Grégoire over 12 years
    This does not work. If the pattern to match is split on two lines of xxd output it will never be found (by default xxd -g0 group lines per 16 bytes). For a pattern of 4 bytes long the probability to have a split is 25%. Also, the grep|awk will print the address of the beginning of the line where the pattern occur, so a delta of up to line size can happen, you end up with more data than you really want.
  • kev
    kev over 12 years
    @lOranger use -c 160 option to reduce the probability.
  • Laurent Grégoire
    Laurent Grégoire over 12 years
    We're not talking about probability here, but certainty! Even with 160 (the max is 256 for xxd), the probability is more than 2%, which is huge. If you automate this, you need a script that works all the time, not 98% of the times. See my answer below for a proposal that works all the time.
  • theta
    theta over 12 years
    lOranger, I used -c64 to compensate a bit, and cut and sed to calculate correct address, but -c1 should be real solution. I'll mark your solution, but when I manage to make it work. First I needed to change place of grep's pattern and filename to make grep work, but regardless I get dd: invalid number I imagine problem in start/len calculation/grammar. Also can't we exclude empty space and save 1/3 of output .hex file which would be double the input file size instead triple as it is now?
  • Laurent Grégoire
    Laurent Grégoire over 12 years
    Sorry, there was a typo in the script: grep pattern should be before the filename. I also added a | head -1 to cover the case where the pattern appears multiple times in the input, which can happen. Concerning your question, the space between hex bytes is necessary, otherwise you have the "nibble" issue (pattern is not aligned on byte boundaries).
  • theta
    theta over 12 years
    I'm afraid it still doesn't work. I get input file as result. I used my -c64 script, and get expected dump, but I was unwilling to post it here as it was fragile on boundaries (better than provided, but still..)
  • Laurent Grégoire
    Laurent Grégoire over 12 years
    Please note that you have to convert your hex pattern to lowercase (or add option -i in grep). I've just tested the script here with a big binary file and it works fine. Please print the value of ${start} and ${len} to debug (you can check that start and len > 0 to prevent cases where the pattern is not found in the input.
  • theta
    theta over 12 years
    Just in case: pastebin.com/raw.php?i=hZ5UqAF9 Patterns are in lower case. It simply returns the input file as dump, so start and end position are 0 and input file length.
  • Laurent Grégoire
    Laurent Grégoire over 12 years
    Well, I tested your script here and it works fine under a bash and sh script (provided I change the pattern to match some data in my input file). You have to check obviously that both patterns appears in the input. Which version of various tools are you using? Also please print ${start} and ${len} to check what's wrong. Please edit the .hex leftover file and manually check that the patterns are present, just in case...
  • theta
    theta over 12 years
    Try it yourself with script from pastebin on this file: ge.tt/1EjaXGE/v/0 (160K)
  • Laurent Grégoire
    Laurent Grégoire over 12 years
  • theta
    theta over 12 years
    WOW, this is so sweet and looks so easy. Couldn't be better than this. I'll leave mark on IOranger's answer as it is correct and answered earlier, but this is by far my favourite snippet
  • jfg956
    jfg956 over 12 years
    Too bad the quickest get the mark, not the shortest... Anyway, it can still be optimized by removing the tr, replacing it inside sed by -e '1h' -e '2,$H' -e '${x;s/\n/ /g}' and modifying the above substitution to be performed only on last line. Note that this solution does not work one huge binary files, as the file need to be put in memory in sed. On huge files, use the awk solution.
  • theta
    theta over 12 years
    Yet another mark reassignment - lOranger' solution fails if 2nd pattern can be found before the 1st - giving $len with negative sign. This solution searches after the 1st pattern match, so it doesn't have such problem, nor generates intermediate triple size file.
  • theta
    theta over 12 years
    Thanks. I tested this on 1GB laptop, and it was fine for 5MB file, but it made my system inaccessible on 50MB file. Is there maybe some general rule for determining "limit" file size based on available RAM, in your opinion?
  • jfg956
    jfg956 over 12 years
    A 50MB file means 150MB once decoded and once bytes are separated by spaces. IT is not that much, but could cause sed to behave very slowly: a line of 150MB is a lot ! You could try the -n option to sed to remove buffering, but it could just worsen the problem. It is difficult to give an opinion on the limit: I do not know about sed implementation. The best is to do many tries. Sorry not to be able to help more.
  • theta
    theta over 12 years
    Thanks. You helped more then enough
  • theta
    theta over 12 years
    After testing this more, I found it without issues, but it's rather slow on larger files. Does anyone see a place for some optimisation, or this is the best one can get from xxd/awk?
  • jfg956
    jfg956 over 12 years
    Try the new sed version that I just post. This one can be optimized replacing string concatenation and extraction with rotatory indexes in arrays, but it is less readable; and I do not want to do it if not needed ;-).
  • theta
    theta over 12 years
    I do have this GNU extension to sed, but can't make this script work for some reason
  • jfg956
    jfg956 over 12 years
    Sorry, typo in the 2nd sed: it should work if you replace /aa\nff\nd9/ with /af\nff\nd9/.
  • theta
    theta over 12 years
    I don't understand what difference that would make? Please try this sample: ge.tt/42cScKE/v/0?c (160K)
  • jfg956
    jfg956 over 12 years
    The link is not working :-(. If you do not have any output, it means that those 2 patterns are not found. You can debug the script running the 2 first commands and adding other after. About the change, I think you are looking for data between ff d8 ff d0 and af ff d9, but the script in my solution above is taking data between ff d8 ff d0 and aa ff d9.
  • theta
    theta over 12 years
    Sorry, link must have expired. I uploaded on other service, please try here: hotfile.com/dl/148193223/e90ab68/bin.dat.html Patterns are of course present in file, I checked multiple times
  • jfg956
    jfg956 over 12 years
    Ok, there was an error in the final test. I corrected it. The error was also in the awk version that I also corrected.
  • Floris
    Floris almost 7 years
    The three sets of wildcards make sed do a lot of recursive searching, probably... I think that may be the reason that things slow down when the file gets big.