How to dump part of binary file

linux bash terminal dump

14,707

Solution 1

In a single pipe:

xxd -c1 -p file |
  awk -v b="ffd8ffd0" -v e="aaffd9" '
    found == 1 {
      print $0
      str = str $0
      if (str == e) {found = 0; exit}
      if (length(str) == length(e)) str = substr(str, 3)}
    found == 0 {
      str = str $0
      if (str == b) {found = 1; print str; str = ""}
      if (length(str) == length(b)) str = substr(str, 3)}
    END{ exit found }' |
  xxd -r -p > new_file
test ${PIPESTATUS[1]} -eq 0 || rm new_file

The idea is to use awk between two xxd to select the part of the file that is needed. Once the 1st pattern is found, awk prints the bytes until the 2nd pattern is found and exit.

The case where the 1st pattern is found but the 2nd is not must be taken into account. It is done in the END part of the awk script, which return a non-zero exit status. This is catch by bash's ${PIPESTATUS[1]} where I decided to delete the new file.

Note that en empty file also mean that nothing has been found.

Solution 2

Locate the start/end position, then extract the range.

$ xxd -g0 input.bin | grep -im1 FFD8FFD0  | awk -F: '{print $1}'
0000cb0
$ ^FFD8FFD0^AFFFD9^
0009590
$ dd ibs=1 count=$((0x9590-0xcb0+1)) skip=$((0xcb0)) if=input.bin of=output.bin

Solution 3

This should work with standard tools (xxd, tr, grep, awk, dd). This correctly handles the "pattern split across line" issue, also look for the pattern only aligned at byte offset (not nibble).

file=<yourfile>
outfile=<youroutputfile>
startpattern="ff d8 ff d0"
endpattern="af ff d9"
xxd -g0 -c1 -ps ${file} | tr '\n' ' ' > ${file}.hex 
start=$((($(grep -bo "${startpattern}" ${file}.hex\
    | head -1 | awk -F: '{print $1}')-1)/3))
len=$((($(grep -bo "${endpattern}" ${file}.hex\
    | head -1 | awk -F: '{print $1}')-1)/3-${start}))
dd ibs=1 count=${len} skip=${start} if=${file} of=${outfile}

Note: The script above use a temporary file to prevent having the binary>hex conversion twice. A space/time trade-off is to pipe the result of xxd directly into the two grep. A one-liner is also possible, at the expense of clarity.

One could also use tee and named pipe to prevent having to store a temporary file and converting output twice, but I'm not sure it would be faster (xxd is fast) and is certainly more complex to write.

Solution 4

See this link for a way to do binary grep. Once you have the start and end offset, you should be able with dd to get what you need.

Solution 5

Another solution in sed, but using less memory:

xxd -c1 -p file |
  sed -n -e '1{N;N;N}' -e '/ff\nd8\nff\nd0/{:begin;p;s/.*//;n;bbegin}' -e 'N;D' | 
  sed -n -e '1{N;N}' -e '/aa\nff\nd9/{p;Q1}' -e 'P;N;D' |
  xxd -r -p > new_file
test ${PIPESTATUS[2]} -eq 1 || rm new_file

The 1st sed prints from ff d8 ff d0 till the end of file. Note that you need as much N in -e '1{N;N;N}' as there is bytes in your 1st pattern less one.

The 2nd sed prints from the beginning of the file to aa ff d9. Note again that you need as much N in -e '1{N;N}' as there is bytes in your 2nd pattern less one.

Again, a test is needed to check if the 2nd pattern is found, and delete the file if it is not.

Note that the Q command is a GNU extension to sed. If you do not have it, you need to trash the rest of the file once the pattern is found (in a loop like the 1st sed, but not printing the file), and check after hex to binary conversion that the new_file end with the wright pattern.

View more solutions

14,707

Author by

theta

Updated on June 27, 2022

Comments

theta about 2 years

I have binary and want to extract part of it, starting from know byte string (i.e. FF D8 FF D0) and ending with known byte string (AF FF D9)

In the past I've used dd to cut part of binary file from beginning/ending but this command doesn't seem to support what I ask.

What tool on terminal can do this?
theta over 12 years

I found "..count=$((0x9590-0xcb0+2)) skip=$((0xcb0+1))..." to match exactly starting from "FFD8.." and ending to "AFFF..". Thank you for your nice procedure. Cheers
theta over 12 years

After couple of extractions I noticed that this is only approximate solution. +1, +2 all depend on content. For example 007d820: 74290068656c6c6f2e6a706700ffd8ff gives 007d820 for both '74 29 00 68' and '00 ff d8 ff' so something slightly different has to be done
Laurent Grégoire over 12 years

This does not work. If the pattern to match is split on two lines of xxd output it will never be found (by default xxd -g0 group lines per 16 bytes). For a pattern of 4 bytes long the probability to have a split is 25%. Also, the grep|awk will print the address of the beginning of the line where the pattern occur, so a delta of up to line size can happen, you end up with more data than you really want.
kev over 12 years

@lOranger use -c 160 option to reduce the probability.
Laurent Grégoire over 12 years

We're not talking about probability here, but certainty! Even with 160 (the max is 256 for xxd), the probability is more than 2%, which is huge. If you automate this, you need a script that works all the time, not 98% of the times. See my answer below for a proposal that works all the time.
theta over 12 years

lOranger, I used -c64 to compensate a bit, and cut and sed to calculate correct address, but -c1 should be real solution. I'll mark your solution, but when I manage to make it work. First I needed to change place of grep's pattern and filename to make grep work, but regardless I get dd: invalid number I imagine problem in start/len calculation/grammar. Also can't we exclude empty space and save 1/3 of output .hex file which would be double the input file size instead triple as it is now?
Laurent Grégoire over 12 years

Sorry, there was a typo in the script: grep pattern should be before the filename. I also added a | head -1 to cover the case where the pattern appears multiple times in the input, which can happen. Concerning your question, the space between hex bytes is necessary, otherwise you have the "nibble" issue (pattern is not aligned on byte boundaries).
theta over 12 years

I'm afraid it still doesn't work. I get input file as result. I used my -c64 script, and get expected dump, but I was unwilling to post it here as it was fragile on boundaries (better than provided, but still..)
Laurent Grégoire over 12 years

Please note that you have to convert your hex pattern to lowercase (or add option -i in grep). I've just tested the script here with a big binary file and it works fine. Please print the value of ${start} and ${len} to debug (you can check that start and len > 0 to prevent cases where the pattern is not found in the input.
theta over 12 years

Just in case: pastebin.com/raw.php?i=hZ5UqAF9 Patterns are in lower case. It simply returns the input file as dump, so start and end position are 0 and input file length.
Laurent Grégoire over 12 years

Well, I tested your script here and it works fine under a bash and sh script (provided I change the pattern to match some data in my input file). You have to check obviously that both patterns appears in the input. Which version of various tools are you using? Also please print ${start} and ${len} to check what's wrong. Please edit the .hex leftover file and manually check that the patterns are present, just in case...
theta over 12 years

Try it yourself with script from pastebin on this file: ge.tt/1EjaXGE/v/0 (160K)
Laurent Grégoire over 12 years

let us continue this discussion in chat
theta over 12 years

WOW, this is so sweet and looks so easy. Couldn't be better than this. I'll leave mark on IOranger's answer as it is correct and answered earlier, but this is by far my favourite snippet
jfg956 over 12 years

Too bad the quickest get the mark, not the shortest... Anyway, it can still be optimized by removing the tr, replacing it inside sed by -e '1h' -e '2,$H' -e '${x;s/\n/ /g}' and modifying the above substitution to be performed only on last line. Note that this solution does not work one huge binary files, as the file need to be put in memory in sed. On huge files, use the awk solution.
theta over 12 years

Yet another mark reassignment - lOranger' solution fails if 2nd pattern can be found before the 1st - giving $len with negative sign. This solution searches after the 1st pattern match, so it doesn't have such problem, nor generates intermediate triple size file.
theta over 12 years

Thanks. I tested this on 1GB laptop, and it was fine for 5MB file, but it made my system inaccessible on 50MB file. Is there maybe some general rule for determining "limit" file size based on available RAM, in your opinion?
jfg956 over 12 years

A 50MB file means 150MB once decoded and once bytes are separated by spaces. IT is not that much, but could cause sed to behave very slowly: a line of 150MB is a lot ! You could try the -n option to sed to remove buffering, but it could just worsen the problem. It is difficult to give an opinion on the limit: I do not know about sed implementation. The best is to do many tries. Sorry not to be able to help more.
theta over 12 years

Thanks. You helped more then enough
theta over 12 years

After testing this more, I found it without issues, but it's rather slow on larger files. Does anyone see a place for some optimisation, or this is the best one can get from xxd/awk?
jfg956 over 12 years

Try the new sed version that I just post. This one can be optimized replacing string concatenation and extraction with rotatory indexes in arrays, but it is less readable; and I do not want to do it if not needed ;-).
theta over 12 years

I do have this GNU extension to sed, but can't make this script work for some reason
jfg956 over 12 years

Sorry, typo in the 2nd sed: it should work if you replace /aa\nff\nd9/ with /af\nff\nd9/.
theta over 12 years

I don't understand what difference that would make? Please try this sample: ge.tt/42cScKE/v/0?c (160K)
jfg956 over 12 years

The link is not working :-(. If you do not have any output, it means that those 2 patterns are not found. You can debug the script running the 2 first commands and adding other after. About the change, I think you are looking for data between ff d8 ff d0 and af ff d9, but the script in my solution above is taking data between ff d8 ff d0 and aa ff d9.
theta over 12 years

Sorry, link must have expired. I uploaded on other service, please try here: hotfile.com/dl/148193223/e90ab68/bin.dat.html Patterns are of course present in file, I checked multiple times
jfg956 over 12 years

Ok, there was an error in the final test. I corrected it. The error was also in the awk version that I also corrected.
Floris almost 7 years

The three sets of wildcards make sed do a lot of recursive searching, probably... I think that may be the reason that things slow down when the file gets big.