How can I work with binary in bash, to copy bytes verbatim without any conversion?

14,097

Solution 1

Dealing with binary data at a low level in shell scripts is generally a bad idea.

bash variables can't contain the byte 0. zsh is the only shell that can store that byte in its variables.

In any case, command arguments and environment variables cannot contain those bytes as they are NUL delimited strings passed to the execve system call.

Also note that:

var=`cmd`

or its modern form:

var=$(cmd)

strips all the trailing newline characters from the output of cmd. So, if that binary output ends in 0xa bytes, it will be mangled when stored in $var.

Here, you'd need to store the data encoded, for instance with xxd -p.

hdr_988=$(head -c 988 < "$inputFile" | xxd -p)
printf '%s\n' "$hdr_988" | xxd -p -r > "$output_hdr"

You could define helper functions like:

encode() {
  eval "$1"='$(
    shift
    "$@" | xxd -p  -c 0x7fffffff
    exit "${PIPESTATUS[0]}")'
}

decode() {
  printf %s "$1" | xxd -p -r
}

encode var cat /bin/ls &&
  decode "$var" | cmp - /bin/ls && echo OK

xxd -p output is not space efficient as it encodes 1 byte in 2 bytes, but it makes it easier to do manipulations with it (concatenating, extracting parts). base64 is one that encodes 3 bytes in 4, but is not as easy to work with.

The ksh93 shell has a builtin encoding format (uses base64) which you can use with its read and printf/print utilities:

typeset -b var # marked as "binary"/"base64-encoded"
IFS= read -rn 988 var < input
printf %B var > output

Now, if there's no transit via shell or env variables, or command arguments, you should be OK as long as the utilities you use can handle any byte value. But note that for text utilities, most non-GNU implementations can't handle NUL bytes, and you'll want to fix the locale to C to avoid problems with multi-byte characters. The last character not being a newline character can also cause problems as well as very long lines (sequences of bytes in between two 0xa bytes that are longer that LINE_MAX).

head -c where it's available should be OK here, as it's meant to work with bytes, and has no reason to treat the data as text. So

head -c 988 < input > output

should be OK. In practice at least the GNU, FreeBSD and ksh93 builtin implementations are OK. POSIX doesn't specify the -c option, but says head should support lines of any length (not limited to LINE_MAX)

With zsh:

IFS= read -rk988 -u0 var < input &&
print -rn -- $var > output

Or:

var=$(head -c 988 < input && echo .) && var=${var%.}
print -rn -- $var > output

Even in zsh, if $var contains NUL bytes, you can pass it as argument to zsh builtins (like print above) or functions, but not as arguments to executables, as arguments passed to executables are NUL delimited strings, that's a kernel limitation, independent of the shell.

Solution 2

I am ambitiously trying to translate a c++ code into bash for a myriad of reasons.

Well yes. But maybe you should consider a very important reason for NOT doing it. Basically, "bash" / "sh" / "csh" / "ksh" and the like are not designed for processing binary data, and neither are most of the standard UNIX / LINUX utilities.

You would be better off either sticking with C++, or using scripting language like Python, Ruby or Perl that is capable of dealing with binary data.

Is there a better way to do this in bash?

The better way is to not do it in bash.

Solution 3

From your question:

copy the first 988 lines of the header

If you are copying 988 lines, then it seems like a text file, not binary. However, your code seems to assume 988 bytes, not 988 lines, so I'll assume bytes is correct.

hdr_988=`head -c 988 ${inputFile}`
echo -n "${hdr_988}" > ${output_hdr}

This part may not work. For one thing, any NUL bytes in the stream will get stripped, because you use ${hdr_988} as a command line argument, and command line arguments cannot contain NUL. The backticks might be doing whitespace munging as well (I'm not sure about that). (Actually, since echo is a built-in, the NUL restriction might not apply, but I would say it's still iffy.)

Why not just write the header directly from the input file to the output file, without passing it through a shell variable?

head -c 988 "${inputFile}" >"${output_hdr}"

Or, more portably,

dd if="${inputFile}" of="${output_hdr}" bs=988 count=1

Since you mention you are using bash, not POSIX shell, you have process substitution available to you, so how about this as a test?

cmp <(head -c 988 "${inputFile}") <(head -c 988 "${output_hdr}")

Finally: consider using $( ... ) instead of backticks.

Share:
14,097

Related videos on Youtube

neurocoder
Author by

neurocoder

Updated on September 18, 2022

Comments

  • neurocoder
    neurocoder over 1 year

    I am ambitiously trying to translate a c++ code into bash for a myriad of reasons.

    This code reads and manipulates a file type specific to my sub-field that is written and structured completely in binary. My first binary-related task is to copy the first 988 bytes of the header, exactly as-is, and put them into an output file that I can continue writing to as I generate the rest of the information.

    I am pretty sure that my current solution isn't working, and realistically I haven't figured out a good way to determine this. So even if it is actually written correctly, I need to know how I would test this to be sure!

    This is what I'm doing right now:

    hdr_988=`head -c 988 ${inputFile}`
    echo -n "${hdr_988}" > ${output_hdr}
    headInput=`head -c 988 ${inputTrack} | hexdump`
    headOutput=`head -c 988 ${output_hdr} | hexdump`
    if [ "${headInput}" != "${headOutput}" ]; then echo "output header was not written properly.  exiting.  please troubleshoot."; exit 1; fi
    

    If I use hexdump/xxd to check out this part of the file, although I can't exactly read most of it, something seems wrong. And the code I have written in for comparison only tells me if two strings are identical, not if they are copied the way I want them to be.

    Is there a better way to do this in bash? Can I simply copy/read binary bytes in native-binary, to copy to a file verbatim? (and ideally to store as variables as well).

    • DDPWNAGE
      DDPWNAGE about 8 years
      You can use dd to copy individual bytes (setting its count to 1). I'm not sure about storing them, though.
    • Ferrybig
      Ferrybig about 8 years
      Don't do bash in the C way, it will create many headaches. Instead use proper bash constructs
  • Freddy Lim
    Freddy Lim about 8 years
    +1 for "The better way is to not do it in bash."
  • Stéphane Chazelas
    Stéphane Chazelas about 8 years
    Note that dd is not necessarily equivalent to head for non-regular files. head will do as many read(2) system calls as necessary to get those 988 bytes while dd will just do one read(2). GNU dd has a iflag=fullblock to try and read that block in full, but that's then even less portable than head -c.
  • fpmurphy
    fpmurphy about 8 years
    Another reason not to go this route is that the resultant application will run significantly slower and consume more system resources.
  • Att Righ
    Att Righ over 6 years
    Bash pipelines can act as a high level domain specific language of sorts that can increase understandability. There is nothing about a pipeline that is not binary, and there are various utilities implemented as command line tools that interact with binary data (ffmpeg, imagemagick, dd). Now if one is doing programming rather than glueing things together then using a full powered programming language is the way to go.
  • fpmurphy
    fpmurphy over 6 years
    zsh is not the only shell that can store one or more NUL bytes in a shell variable. ksh93 can do so also. Internally, ksh93 simply stores the binary variable as a base64-encoded string.
  • Stéphane Chazelas
    Stéphane Chazelas over 6 years
    @fpmurphy1, that's not what I call handling binary data, the variable doesn't contain the binary data, so you can't use any of the shell operators on them for instance, you can't pass them to builtins or functions in its decoded form... I'd call it rather builtin base64 encoding/decoding support.
  • Melab
    Melab over 3 years
    How can zsh variables contain null bytes if environment variables cannot contain them?