How to handle CRLF line endings in grep?
based on this page. try these solutions
curl -sI http://unix.stackexchange.com | head -4 | grep "200 OK$(printf '\r')"
grep -IUlr $'\r'
Related videos on Youtube
Witiko
Updated on September 18, 2022Comments
-
Witiko over 1 year
Suppose I have an arbitrary text input that contains CRLF line endings:
$ curl -sI http://unix.stackexchange.com | head -4 HTTP/1.1 200 OK Cache-Control: public, max-age=60 Content-Length: 80551 Content-Type: text/html; charset=utf-8 $ curl -sI http://unix.stackexchange.com | head -4 | hexdump -C 00000000 48 54 54 50 2f 31 2e 31 20 32 30 30 20 4f 4b 0d |HTTP/1.1 200 OK.| 00000010 0a 43 61 63 68 65 2d 43 6f 6e 74 72 6f 6c 3a 20 |.Cache-Control: | 00000020 70 75 62 6c 69 63 2c 20 6d 61 78 2d 61 67 65 3d |public, max-age=| 00000030 36 30 0d 0a 43 6f 6e 74 65 6e 74 2d 4c 65 6e 67 |60..Content-Leng| 00000040 74 68 3a 20 38 30 39 30 32 0d 0a 43 6f 6e 74 65 |th: 80902..Conte| 00000050 6e 74 2d 54 79 70 65 3a 20 74 65 78 74 2f 68 74 |nt-Type: text/ht| 00000060 6d 6c 3b 20 63 68 61 72 73 65 74 3d 75 74 66 2d |ml; charset=utf-| 00000070 38 0d 0a |8..| 00000073
GNU
grep
2.26 does not handle such input very well with respect to line endings:$ curl -sI http://unix.stackexchange.com | head -4 | grep '200 OK$' $ curl -sI http://unix.stackexchange.com | head -4 | grep '200 OK.$' HTTP/1.1 200 OK
This is a little annoying. I can of course resolve this by including
dos2unix
into the pipeline:$ curl -sI http://unix.stackexchange.com | head -4 | dos2unix | grep '200 OK$' HTTP/1.1 200 OK
but this feels a little hamfisted (and not very portable).
The weird thing in general is that thegrep(2)
man page claims that the tool will strip any CRs in the input, unless the input has been detected as binary:-U, --binary Treat the file(s) as binary. By default, under MS-DOS and MS-Windows, grep guesses whether a file is text or binary as described for the --binary-files option. If grep decides the file is a text file, it strips the CR characters from the original file contents (to make regular expressions with ^ and $ work correctly). Specifying -U overrules this guesswork, causing all files to be read and passed to the matching mechanism verbatim; if the file is a text file with CR/LF pairs at the end of each line, this will cause some regular expressions to fail. This option has no effect on platforms other than MS-DOS and MS-Windows.
EDIT: As stated in the manpage, this behaviour is MS-DOS and MS-Windows specific.
Is it possible to make
grep
transparently handle CRLF (and CR) line endings without preprocessing the input? If not, is this something that should be patched, or is there a well-founded rationale?-
JdeBP over 7 yearsModifying a pipeline by adding a filter partway through should not feel hamfisted. It's a Unix norm.
dos2unix
may not be portable, buttr
andsed
are and can do the same filtering. It's aperl
one-liner, too. -
Witiko over 7 yearsSure, but it still adds complexity to the command and this strikes me as a common enough problem to warrant direct support within
grep
. -
Angel Todorov over 7 yearsA sed equivalent of
dos2unix
issed 's/\r$//'
-
Witiko over 7 years… which is equivalent to using the pattern
something\r\?$
instead ofsomething$
directly within grep. Still, it is an annoyance and a level of detail I would expect grep to abstract away (if I ask nicely enough through some flag). Suppose you are grepping through a file that uses only\r
to end lines (the way old macs did). Then it becomes more than an annoyance, sincegrep
will not recognize these as line endings and buffer the entire file as a single line. Of course,sed 's/\r/\n/g'
will fix this, but how would anyone think having to do this is a good idea baffles me. -
Admin almost 2 yearsplus these filter solutions aren't suitable when I want to grep in multiple files in one go, like
grep "pattern$" *.txt
-
-
Witiko over 7 yearsI can of course use
$(printf '\r')
– or$'\r'
in bash – to insert a literal CR into the pattern. What I'm asking, however, is if there is a way for me to not have to do that. I'd like to match line endings transparently (i.e. regardless of whether they consist of a CR, LF, or CRLF). -
Witiko about 5 yearsAs I indicated in the comment section of the original question, this goes deeper than playing with the pattern, since grep will only buffer lines terminated by
\n
, whereas\r
does not terminate a line from the buffering standpoint.