Replacing null bytes with `sed` vs `tr`

10,730

Solution 1

From the manual page of tr(1):

SETs are specified as strings of characters ... Interpreted sequences are:
\NNN character with octal value NNN (1 to 3 octal digits)

For sed(1), the manual page is not so clear, so a few tries can show something:

echo -n hi |sed 's/h/t/g' |hexdump -c    (0000000   t   i)

Easy. Then:

echo -n hi |sed 's/h//g' |hexdump -c      (0000000   i)

Empty pattern deletes the match. Again easy. Then:

echo -n hi |sed 's/h/\0/g' |hexdump -c    (0000000   h   i)

This \0 seems to do nothing. So try

echo -n hi |sed 's/h/\00/g' |hexdump -c   (0000000   h   0   i)

Oh! Could it take \0 as a reference to the matched part? This would explain also the previous example. sed man page talks about \1 to \9, not \0 (but \0 has a meaning anyway, even in the pattern specification).

So, to cut it short: for sed, \0 has a special meaning which is not a NUL char. But it understands octal:

echo -n hi |sed 's/h/\o0/g' |hexdump -c    (0000000  \0   i)

and hexadecimal:

echo -n hi |sed 's/h/\x0/g' |hexdump -c    (0000000  \0   i)

As pointed out in the comments, tr and sed are different tools, designed differently. Yes, sed uses regexp while tr does not, but this is not the general explanation about \0 is interpreted differently. In the messy world of unix there are, often, some conventions. In the messy world of unix there are, more often, exceptions to those conventions.

Solution 2

The latter two commands in the question does work:

$ sed --version
sed (GNU sed) 4.4
Packaged by Cygwin (4.4-1)

$ echo -e "Hello\0World" | hexdump.exe -c
0000000   H   e   l   l   o  \0   W   o   r   l   d  \n                
000000c

$ echo -e "Hello\0World" | sed 's/\x0/MyString/g'
HelloMyStringWorld

$ echo -e "Hello\0World" | sed 's/\x00/MyString/g'
HelloMyStringWorld

Octal sequences have to be prefixed by \o (thanks, Benjamin W., for this hint):

$ echo -e "Hello\0World" | sed 's/\o0/MyString/g'
HelloMyStringWorld

Thus, there must be another issue in the OP.

Solution 3

Specious question: there is no tr and sed per se. Rather there are versions of these programs across time and os platforms. Generally speaking UNIX's history is a rapid florescence of variation; more specifically tr was released for Version 4 Unix in 1973, while sed first appeared in Version 7 Unix in 1979. From the get-go, these were written by different authors, on different os, for different shells, with different purposes (note: Bash was written much latter in 1989 and is NOT the "owner" of either of these utilities). And, things only get more varied and complex in terms of how these programs independently evolved, were maintained (again by different authors), how/which bugs were fixed, etc. While much effort has been made of late to standardize core utilities, assuming that sed and tr would treat characters in the exact same way is failing to grok the history, the troublesome lack of standards as well the strangely beneficial plurality of UNIX itself.

Share:
10,730
Admin
Author by

Admin

Updated on June 21, 2022

Comments

  • Admin
    Admin almost 2 years

    Bash newbie; using this idiom to generate repeats of a string:

    echo $(head -c $numrepeats /dev/zero | tr '\0' 'S')
    

    I decided I wanted to replace each null byte with more than one character (eg. 'MyString' instead of just 'S'), so I tried the following with sed

    echo $(head -c $numrepeats /dev/zero | sed 's/\0/MyString/g' )
    

    But I just get an empty output. I realized I have to do

    echo $(head -c $numrepeats /dev/zero | sed 's/\x0/MyString/g' )
    

    or

    echo $(head -c $numrepeats /dev/zero | sed 's/\x00/MyString/g' )
    

    instead, but I don't understand why. What is the difference between the characters that tr and sed match? Is it because sed is matching against a regex?

    Edit Interesting discovery that \0 in the replacement portion of the 's/regexp/replacement' sed command actually behaves the same as &. Still doesn't explain why \0 in regexp doesn't match the nullbyte though (as it does in tr and most other regex implementations)

  • Benjamin W.
    Benjamin W. about 7 years
    I guess it depends on the sed version. My GNU sed 4.3 doesn't understand \0, only \o0, \d0 or \x0.
  • Admin
    Admin about 7 years
    This answer just shows precisely my point, you have to specify the null byte in sed (with \x0 or \x00 of \o0 or whatever) differently than you do with tr. My question is why this is the case, if there is a proper answer to that at all. Or it may just be one of those Bash quirks which always catch beginners :/
  • Admin
    Admin about 7 years
    It makes sense that, in replacement, \0 refers, like &, to the matched portion of regexp (although as you say, this is not explicitly stated in the manpages) ie. it has a special meaning which is not the nullbyte but only in the replacement portion. This still doesn't explain why \0 in the replacement part doesn't match the nullbyte (as it does in tr and in most other regex implementations).
  • Admin
    Admin about 7 years
    Also pointing out that tr and sed have no obligation to behave the same doesn't answer the question either (yes, that much is obvious, I don't dispute the fact, I was just wondering if anyone knew of a coherent explanation or whether it's "just another Bash quirk")
  • Admin
    Admin about 7 years
    Nonetheless, thanks for uncovering the \0 as & behaviour in the replacement portion, TIL
  • linuxfan says Reinstate Monica
    linuxfan says Reinstate Monica about 7 years
    @user141554 There is no bash quirk here (I verified): using single quotes makes all arguments plain and literal. tr uses the \xxx for octal notation (and lacks decimal and hex) while sed uses the \x to indicate a different thing - not characters. But it has octal decimal and hex. I find it enough coherent, apart the manual pages that often are cryptic and imprecise. I think your question has got an answer about the sed's substitute command replacement part. The \x plays a role even in the sed's substitute matching part but that's another story.
  • Admin
    Admin about 7 years
    Hmm, I think if \0 in the regexp portion were similar to the special characters \1 to \9, then we should get something like sed: -e expression #1, char 9: Invalid back reference if we tried to do something like ls | sed 's/\0//g' for instance. Because that is what we get for ls | sed 's/\1//g'. But for \0 it just doesn't match and replace anything
  • Scheff's Cat
    Scheff's Cat about 7 years
    @user141554 I guess it's actually not a "Bash quirk". The reg. expr. of sed is "secured" by single quotes. Thus, the bash may not be blamed for this. But I agree with you: I consider this as one of the less lucky design decisions although there might exist reasons for this. - There are a handful of really useful "standard" tools for text processing in Unix-likes. Each of them seems to have its own flavor of reg. expr. language. (Esp. when to use backslash for meta and when not drives me crazy...)
  • linuxfan says Reinstate Monica
    linuxfan says Reinstate Monica about 7 years
    @user141554 I didn't find any docs about \0 in pattern; from a few tests, it seems that it is simply ignored. It makes some sense... if \1 .. \9 have a meaning, \0 has the same syntax but is not assigned so it is simply ignored/skipped. Perhaps a peek at the sources is the only way to clarify this.
  • linuxfan says Reinstate Monica
    linuxfan says Reinstate Monica about 7 years
    @user141554 I took a peek at sed sources (there are quite a few versions, I looked only at one). While examining regexp for substitute command, there is explicit reference to \1 .. \9, in a switch(), but never to \0. I didn't go deeper ... I think it is simply ignored.
  • Admin
    Admin about 7 years
    Alright, I guess the question has more or less evolved into a "what does that \0 do in the regexp part of sed", and my curiosity has been pretty much satisfied. So thanks!