Replacing null bytes with `sed` vs `tr`
Solution 1
From the manual page of tr(1):
SETs are specified as strings of characters ... Interpreted sequences are:
\NNN character with octal value NNN (1 to 3 octal digits)
For sed(1), the manual page is not so clear, so a few tries can show something:
echo -n hi |sed 's/h/t/g' |hexdump -c (0000000 t i)
Easy. Then:
echo -n hi |sed 's/h//g' |hexdump -c (0000000 i)
Empty pattern deletes the match. Again easy. Then:
echo -n hi |sed 's/h/\0/g' |hexdump -c (0000000 h i)
This \0 seems to do nothing. So try
echo -n hi |sed 's/h/\00/g' |hexdump -c (0000000 h 0 i)
Oh! Could it take \0 as a reference to the matched part? This would explain also the previous example. sed man page talks about \1 to \9, not \0 (but \0 has a meaning anyway, even in the pattern specification).
So, to cut it short: for sed, \0 has a special meaning which is not a NUL char. But it understands octal:
echo -n hi |sed 's/h/\o0/g' |hexdump -c (0000000 \0 i)
and hexadecimal:
echo -n hi |sed 's/h/\x0/g' |hexdump -c (0000000 \0 i)
As pointed out in the comments, tr and sed are different tools, designed differently. Yes, sed uses regexp while tr does not, but this is not the general explanation about \0 is interpreted differently. In the messy world of unix there are, often, some conventions. In the messy world of unix there are, more often, exceptions to those conventions.
Solution 2
The latter two commands in the question does work:
$ sed --version
sed (GNU sed) 4.4
Packaged by Cygwin (4.4-1)
$ echo -e "Hello\0World" | hexdump.exe -c
0000000 H e l l o \0 W o r l d \n
000000c
$ echo -e "Hello\0World" | sed 's/\x0/MyString/g'
HelloMyStringWorld
$ echo -e "Hello\0World" | sed 's/\x00/MyString/g'
HelloMyStringWorld
Octal sequences have to be prefixed by \o
(thanks, Benjamin W., for this hint):
$ echo -e "Hello\0World" | sed 's/\o0/MyString/g'
HelloMyStringWorld
Thus, there must be another issue in the OP.
Solution 3
Specious question: there is no tr
and sed
per se. Rather there are versions of these programs across time and os platforms. Generally speaking UNIX's history is a rapid florescence of variation; more specifically tr
was released for Version 4 Unix in 1973, while sed
first appeared in Version 7 Unix in 1979. From the get-go, these were written by different authors, on different os
, for different shells, with different purposes (note: Bash was written much latter in 1989 and is NOT the "owner" of either of these utilities). And, things only get more varied and complex in terms of how these programs independently evolved, were maintained (again by different authors), how/which bugs were fixed, etc. While much effort has been made of late to standardize core utilities, assuming that sed
and tr
would treat characters in the exact same way is failing to grok the history, the troublesome lack of standards as well the strangely beneficial plurality of UNIX itself.
Admin
Updated on June 21, 2022Comments
-
Admin almost 2 years
Bash newbie; using this idiom to generate repeats of a string:
echo $(head -c $numrepeats /dev/zero | tr '\0' 'S')
I decided I wanted to replace each null byte with more than one character (eg. 'MyString' instead of just 'S'), so I tried the following with sed
echo $(head -c $numrepeats /dev/zero | sed 's/\0/MyString/g' )
But I just get an empty output. I realized I have to do
echo $(head -c $numrepeats /dev/zero | sed 's/\x0/MyString/g' )
or
echo $(head -c $numrepeats /dev/zero | sed 's/\x00/MyString/g' )
instead, but I don't understand why. What is the difference between the characters that
tr
andsed
match? Is it becausesed
is matching against a regex?Edit Interesting discovery that
\0
in thereplacement
portion of the's/regexp/replacement'
sed
command actually behaves the same as&
. Still doesn't explain why\0
inregexp
doesn't match the nullbyte though (as it does intr
and most other regex implementations) -
Benjamin W. about 7 yearsI guess it depends on the sed version. My GNU sed 4.3 doesn't understand
\0
, only\o0
,\d0
or\x0
. -
Admin about 7 yearsThis answer just shows precisely my point, you have to specify the null byte in
sed
(with\x0
or\x00
of\o0
or whatever) differently than you do withtr
. My question is why this is the case, if there is a proper answer to that at all. Or it may just be one of those Bash quirks which always catch beginners :/ -
Admin about 7 yearsIt makes sense that, in
replacement
,\0
refers, like&
, to the matched portion ofregexp
(although as you say, this is not explicitly stated in the manpages) ie. it has a special meaning which is not the nullbyte but only in thereplacement
portion. This still doesn't explain why\0
in thereplacement
part doesn't match the nullbyte (as it does intr
and in most other regex implementations). -
Admin about 7 yearsAlso pointing out that
tr
andsed
have no obligation to behave the same doesn't answer the question either (yes, that much is obvious, I don't dispute the fact, I was just wondering if anyone knew of a coherent explanation or whether it's "just another Bash quirk") -
Admin about 7 yearsNonetheless, thanks for uncovering the
\0
as&
behaviour in thereplacement
portion, TIL -
linuxfan says Reinstate Monica about 7 years@user141554 There is no bash quirk here (I verified): using single quotes makes all arguments plain and literal.
tr
uses the\xxx
for octal notation (and lacks decimal and hex) whilesed
uses the\x
to indicate a different thing - not characters. But it has octal decimal and hex. I find it enough coherent, apart the manual pages that often are cryptic and imprecise. I think your question has got an answer about the sed's substitute command replacement part. The\x
plays a role even in the sed's substitute matching part but that's another story. -
Admin about 7 yearsHmm, I think if
\0
in theregexp
portion were similar to the special characters\1
to\9
, then we should get something likesed: -e expression #1, char 9: Invalid back reference
if we tried to do something likels | sed 's/\0//g'
for instance. Because that is what we get forls | sed 's/\1//g'
. But for\0
it just doesn't match and replace anything -
Scheff's Cat about 7 years@user141554 I guess it's actually not a "Bash quirk". The reg. expr. of
sed
is "secured" by single quotes. Thus, the bash may not be blamed for this. But I agree with you: I consider this as one of the less lucky design decisions although there might exist reasons for this. - There are a handful of really useful "standard" tools for text processing in Unix-likes. Each of them seems to have its own flavor of reg. expr. language. (Esp. when to use backslash for meta and when not drives me crazy...) -
linuxfan says Reinstate Monica about 7 years@user141554 I didn't find any docs about \0 in pattern; from a few tests, it seems that it is simply ignored. It makes some sense... if \1 .. \9 have a meaning, \0 has the same syntax but is not assigned so it is simply ignored/skipped. Perhaps a peek at the sources is the only way to clarify this.
-
linuxfan says Reinstate Monica about 7 years@user141554 I took a peek at sed sources (there are quite a few versions, I looked only at one). While examining regexp for substitute command, there is explicit reference to \1 .. \9, in a switch(), but never to \0. I didn't go deeper ... I think it is simply ignored.
-
Admin about 7 yearsAlright, I guess the question has more or less evolved into a "what does that \0 do in the regexp part of sed", and my curiosity has been pretty much satisfied. So thanks!