How to handle CRLF line endings in grep?

8,245

based on this page. try these solutions

https://stackoverflow.com/questions/73833/how-do-you-search-for-files-containing-dos-line-endings-crlf-with-grep-on-linu

curl -sI http://unix.stackexchange.com | head -4  | grep "200 OK$(printf '\r')" 

grep -IUlr $'\r'
Share:
8,245

Related videos on Youtube

Witiko
Author by

Witiko

Updated on September 18, 2022

Comments

  • Witiko
    Witiko over 1 year

    Suppose I have an arbitrary text input that contains CRLF line endings:

    $ curl -sI http://unix.stackexchange.com | head -4
    HTTP/1.1 200 OK
    Cache-Control: public, max-age=60
    Content-Length: 80551
    Content-Type: text/html; charset=utf-8
    
    $ curl -sI http://unix.stackexchange.com | head -4 | hexdump -C
    00000000  48 54 54 50 2f 31 2e 31  20 32 30 30 20 4f 4b 0d  |HTTP/1.1 200 OK.|
    00000010  0a 43 61 63 68 65 2d 43  6f 6e 74 72 6f 6c 3a 20  |.Cache-Control: |
    00000020  70 75 62 6c 69 63 2c 20  6d 61 78 2d 61 67 65 3d  |public, max-age=|
    00000030  36 30 0d 0a 43 6f 6e 74  65 6e 74 2d 4c 65 6e 67  |60..Content-Leng|
    00000040  74 68 3a 20 38 30 39 30  32 0d 0a 43 6f 6e 74 65  |th: 80902..Conte|
    00000050  6e 74 2d 54 79 70 65 3a  20 74 65 78 74 2f 68 74  |nt-Type: text/ht|
    00000060  6d 6c 3b 20 63 68 61 72  73 65 74 3d 75 74 66 2d  |ml; charset=utf-|
    00000070  38 0d 0a                                          |8..|
    00000073
    

    GNU grep 2.26 does not handle such input very well with respect to line endings:

    $ curl -sI http://unix.stackexchange.com | head -4 | grep '200 OK$'
    $ curl -sI http://unix.stackexchange.com | head -4 | grep '200 OK.$'
    HTTP/1.1 200 OK
    

    This is a little annoying. I can of course resolve this by including dos2unix into the pipeline:

    $ curl -sI http://unix.stackexchange.com | head -4 | dos2unix | grep '200 OK$'
    HTTP/1.1 200 OK
    

    but this feels a little hamfisted (and not very portable).

    The weird thing in general is that the grep(2) man page claims that the tool will strip any CRs in the input, unless the input has been detected as binary:

    -U, --binary
           Treat the file(s) as binary.  By default, under MS-DOS and MS-Windows,
           grep guesses whether a file is text or binary  as  described  for  the
           --binary-files  option.   If  grep decides the file is a text file, it
           strips the CR characters from the  original  file  contents  (to  make
           regular  expressions  with  ^  and  $  work correctly).  Specifying -U
           overrules this guesswork, causing all files to be read and  passed  to
           the matching mechanism verbatim; if the file is a text file with CR/LF
           pairs  at  the  end  of  each  line,  this  will  cause  some  regular
           expressions  to  fail.   This  option has no effect on platforms other
           than MS-DOS and MS-Windows.
    

    EDIT: As stated in the manpage, this behaviour is MS-DOS and MS-Windows specific.

    Is it possible to make grep transparently handle CRLF (and CR) line endings without preprocessing the input? If not, is this something that should be patched, or is there a well-founded rationale?

    • JdeBP
      JdeBP over 7 years
      Modifying a pipeline by adding a filter partway through should not feel hamfisted. It's a Unix norm. dos2unix may not be portable, but tr and sed are and can do the same filtering. It's a perl one-liner, too.
    • Witiko
      Witiko over 7 years
      Sure, but it still adds complexity to the command and this strikes me as a common enough problem to warrant direct support within grep.
    • Angel Todorov
      Angel Todorov over 7 years
      A sed equivalent of dos2unix is sed 's/\r$//'
    • Witiko
      Witiko over 7 years
      … which is equivalent to using the pattern something\r\?$ instead of something$ directly within grep. Still, it is an annoyance and a level of detail I would expect grep to abstract away (if I ask nicely enough through some flag). Suppose you are grepping through a file that uses only \r to end lines (the way old macs did). Then it becomes more than an annoyance, since grep will not recognize these as line endings and buffer the entire file as a single line. Of course, sed 's/\r/\n/g' will fix this, but how would anyone think having to do this is a good idea baffles me.
    • Admin
      Admin almost 2 years
      plus these filter solutions aren't suitable when I want to grep in multiple files in one go, like grep "pattern$" *.txt
  • Witiko
    Witiko over 7 years
    I can of course use $(printf '\r') – or $'\r' in bash – to insert a literal CR into the pattern. What I'm asking, however, is if there is a way for me to not have to do that. I'd like to match line endings transparently (i.e. regardless of whether they consist of a CR, LF, or CRLF).
  • Witiko
    Witiko about 5 years
    As I indicated in the comment section of the original question, this goes deeper than playing with the pattern, since grep will only buffer lines terminated by \n, whereas \r does not terminate a line from the buffering standpoint.