How to dump part of binary file
Solution 1
In a single pipe:
xxd -c1 -p file |
awk -v b="ffd8ffd0" -v e="aaffd9" '
found == 1 {
print $0
str = str $0
if (str == e) {found = 0; exit}
if (length(str) == length(e)) str = substr(str, 3)}
found == 0 {
str = str $0
if (str == b) {found = 1; print str; str = ""}
if (length(str) == length(b)) str = substr(str, 3)}
END{ exit found }' |
xxd -r -p > new_file
test ${PIPESTATUS[1]} -eq 0 || rm new_file
The idea is to use awk
between two xxd
to select the part of the file that is needed. Once the 1st pattern is found, awk
prints the bytes until the 2nd pattern is found and exit.
The case where the 1st pattern is found but the 2nd is not must be taken into account. It is done in the END
part of the awk
script, which return a non-zero exit status. This is catch by bash
's ${PIPESTATUS[1]}
where I decided to delete the new file.
Note that en empty file also mean that nothing has been found.
Solution 2
Locate the start/end position, then extract the range.
$ xxd -g0 input.bin | grep -im1 FFD8FFD0 | awk -F: '{print $1}'
0000cb0
$ ^FFD8FFD0^AFFFD9^
0009590
$ dd ibs=1 count=$((0x9590-0xcb0+1)) skip=$((0xcb0)) if=input.bin of=output.bin
Solution 3
This should work with standard tools (xxd, tr, grep, awk, dd). This correctly handles the "pattern split across line" issue, also look for the pattern only aligned at byte offset (not nibble).
file=<yourfile>
outfile=<youroutputfile>
startpattern="ff d8 ff d0"
endpattern="af ff d9"
xxd -g0 -c1 -ps ${file} | tr '\n' ' ' > ${file}.hex
start=$((($(grep -bo "${startpattern}" ${file}.hex\
| head -1 | awk -F: '{print $1}')-1)/3))
len=$((($(grep -bo "${endpattern}" ${file}.hex\
| head -1 | awk -F: '{print $1}')-1)/3-${start}))
dd ibs=1 count=${len} skip=${start} if=${file} of=${outfile}
Note: The script above use a temporary file to prevent having the binary>hex conversion twice. A space/time trade-off is to pipe the result of xxd
directly into the two grep
. A one-liner is also possible, at the expense of clarity.
One could also use tee
and named pipe to prevent having to store a temporary file and converting output twice, but I'm not sure it would be faster (xxd is fast) and is certainly more complex to write.
Solution 4
See this link for a way to do binary grep. Once you have the start and end offset, you should be able with dd
to get what you need.
Solution 5
Another solution in sed
, but using less memory:
xxd -c1 -p file |
sed -n -e '1{N;N;N}' -e '/ff\nd8\nff\nd0/{:begin;p;s/.*//;n;bbegin}' -e 'N;D' |
sed -n -e '1{N;N}' -e '/aa\nff\nd9/{p;Q1}' -e 'P;N;D' |
xxd -r -p > new_file
test ${PIPESTATUS[2]} -eq 1 || rm new_file
The 1st sed
prints from ff d8 ff d0
till the end of file. Note that you need as much N
in -e '1{N;N;N}'
as there is bytes in your 1st pattern less one.
The 2nd sed
prints from the beginning of the file to aa ff d9
. Note again that you need as much N
in -e '1{N;N}'
as there is bytes in your 2nd pattern less one.
Again, a test is needed to check if the 2nd pattern is found, and delete the file if it is not.
Note that the Q
command is a GNU extension to sed
. If you do not have it, you need to trash the rest of the file once the pattern is found (in a loop like the 1st sed
, but not printing the file), and check after hex to binary conversion that the new_file end with the wright pattern.
theta
Updated on June 27, 2022Comments
-
theta about 2 years
I have binary and want to extract part of it, starting from know byte string (i.e. FF D8 FF D0) and ending with known byte string (AF FF D9)
In the past I've used
dd
to cut part of binary file from beginning/ending but this command doesn't seem to support what I ask.What tool on terminal can do this?
-
theta over 12 yearsI found "..count=$((0x9590-0xcb0+2)) skip=$((0xcb0+1))..." to match exactly starting from "FFD8.." and ending to "AFFF..". Thank you for your nice procedure. Cheers
-
theta over 12 yearsAfter couple of extractions I noticed that this is only approximate solution. +1, +2 all depend on content. For example
007d820: 74290068656c6c6f2e6a706700ffd8ff
gives 007d820 for both '74 29 00 68' and '00 ff d8 ff' so something slightly different has to be done -
Laurent Grégoire over 12 yearsThis does not work. If the pattern to match is split on two lines of
xxd
output it will never be found (by defaultxxd -g0
group lines per 16 bytes). For a pattern of 4 bytes long the probability to have a split is 25%. Also, thegrep|awk
will print the address of the beginning of the line where the pattern occur, so a delta of up to line size can happen, you end up with more data than you really want. -
kev over 12 years@lOranger use
-c 160
option to reduce the probability. -
Laurent Grégoire over 12 yearsWe're not talking about probability here, but certainty! Even with 160 (the max is 256 for xxd), the probability is more than 2%, which is huge. If you automate this, you need a script that works all the time, not 98% of the times. See my answer below for a proposal that works all the time.
-
theta over 12 yearslOranger, I used -c64 to compensate a bit, and
cut
andsed
to calculate correct address, but -c1 should be real solution. I'll mark your solution, but when I manage to make it work. First I needed to change place ofgrep
's pattern and filename to make grep work, but regardless I getdd: invalid number
I imagine problem in start/len calculation/grammar. Also can't we exclude empty space and save 1/3 of output .hex file which would be double the input file size instead triple as it is now? -
Laurent Grégoire over 12 yearsSorry, there was a typo in the script:
grep
pattern should be before the filename. I also added a| head -1
to cover the case where the pattern appears multiple times in the input, which can happen. Concerning your question, the space between hex bytes is necessary, otherwise you have the "nibble" issue (pattern is not aligned on byte boundaries). -
theta over 12 yearsI'm afraid it still doesn't work. I get input file as result. I used my -c64 script, and get expected dump, but I was unwilling to post it here as it was fragile on boundaries (better than provided, but still..)
-
Laurent Grégoire over 12 yearsPlease note that you have to convert your hex pattern to lowercase (or add option
-i
ingrep
). I've just tested the script here with a big binary file and it works fine. Please print the value of ${start} and ${len} to debug (you can check that start and len > 0 to prevent cases where the pattern is not found in the input. -
theta over 12 yearsJust in case: pastebin.com/raw.php?i=hZ5UqAF9 Patterns are in lower case. It simply returns the input file as dump, so start and end position are 0 and input file length.
-
Laurent Grégoire over 12 yearsWell, I tested your script here and it works fine under a
bash
andsh
script (provided I change the pattern to match some data in my input file). You have to check obviously that both patterns appears in the input. Which version of various tools are you using? Also please print${start}
and${len}
to check what's wrong. Please edit the .hex leftover file and manually check that the patterns are present, just in case... -
theta over 12 yearsTry it yourself with script from pastebin on this file: ge.tt/1EjaXGE/v/0 (160K)
-
Laurent Grégoire over 12 years
-
theta over 12 yearsWOW, this is so sweet and looks so easy. Couldn't be better than this. I'll leave mark on IOranger's answer as it is correct and answered earlier, but this is by far my favourite snippet
-
jfg956 over 12 yearsToo bad the quickest get the mark, not the shortest... Anyway, it can still be optimized by removing the
tr
, replacing it insidesed
by-e '1h' -e '2,$H' -e '${x;s/\n/ /g}'
and modifying the above substitution to be performed only on last line. Note that this solution does not work one huge binary files, as the file need to be put in memory insed
. On huge files, use theawk
solution. -
theta over 12 yearsYet another mark reassignment - lOranger' solution fails if 2nd pattern can be found before the 1st - giving $len with negative sign. This solution searches after the 1st pattern match, so it doesn't have such problem, nor generates intermediate triple size file.
-
theta over 12 yearsThanks. I tested this on 1GB laptop, and it was fine for 5MB file, but it made my system inaccessible on 50MB file. Is there maybe some general rule for determining "limit" file size based on available RAM, in your opinion?
-
jfg956 over 12 yearsA 50MB file means 150MB once decoded and once bytes are separated by spaces. IT is not that much, but could cause
sed
to behave very slowly: a line of 150MB is a lot ! You could try the-n
option tosed
to remove buffering, but it could just worsen the problem. It is difficult to give an opinion on the limit: I do not know aboutsed
implementation. The best is to do many tries. Sorry not to be able to help more. -
theta over 12 yearsThanks. You helped more then enough
-
theta over 12 yearsAfter testing this more, I found it without issues, but it's rather slow on larger files. Does anyone see a place for some optimisation, or this is the best one can get from xxd/awk?
-
jfg956 over 12 yearsTry the new
sed
version that I just post. This one can be optimized replacing string concatenation and extraction with rotatory indexes in arrays, but it is less readable; and I do not want to do it if not needed ;-). -
theta over 12 yearsI do have this GNU extension to sed, but can't make this script work for some reason
-
jfg956 over 12 yearsSorry, typo in the 2nd
sed
: it should work if you replace/aa\nff\nd9/
with/af\nff\nd9/
. -
theta over 12 yearsI don't understand what difference that would make? Please try this sample: ge.tt/42cScKE/v/0?c (160K)
-
jfg956 over 12 yearsThe link is not working :-(. If you do not have any output, it means that those 2 patterns are not found. You can debug the script running the 2 first commands and adding other after. About the change, I think you are looking for data between
ff d8 ff d0
andaf ff d9
, but the script in my solution above is taking data betweenff d8 ff d0
andaa ff d9
. -
theta over 12 yearsSorry, link must have expired. I uploaded on other service, please try here: hotfile.com/dl/148193223/e90ab68/bin.dat.html Patterns are of course present in file, I checked multiple times
-
jfg956 over 12 yearsOk, there was an error in the final test. I corrected it. The error was also in the awk version that I also corrected.
-
Floris almost 7 yearsThe three sets of wildcards make
sed
do a lot of recursive searching, probably... I think that may be the reason that things slow down when the file gets big.