Why does awk -F work for most letters, but not for the letter "t"?

6,518

Because:

Normally, any number of blanks separate fields. In order to set the field separator to a single blank, use the -F option with a value of [ ]. If a field separator of t is specified, awk treats it as if \t had been specified and uses <TAB> as the field separator. In order to use a literal t as the field separator, use the -F option with a value of [t].

That's from the FreeBSD awk man page, and the utilities that come with macOS are usually some old FreeBSD versions or such.

$ printf 'foo\tbar\n' | awk -F t '{print NF-1}'
1
$ echo total | awk -F '[t]' '{print NF-1}'
2

In a way, that seems like a useful shorthand for files with tab-separated values, but what with other letters taken as-is, it's confusing. It only works like that with -F, using -v FS=t doesn't do it.

The feature is non-POSIX, as POSIX says that -F x is the same as -v FS=x. Most other awks I tested treated t as the the literal letter (some versions of gawk, mawk and Busybox).

The version of awk that e.g. Debian has in the original-awk package ("One True AWK" or "BWK awk" presumably from Brian W. Kernighan's initials) does support it, though, and at least Wikipedia seems to indicate that would be the same software FreeBSD uses. This one appears to be based on the version described in the 1988 book "The AWK Programming Language", but I'm not an expert on awk lineages and don't know if it has evolved significantly since then. That one is on github, but the documentation there doesn't seem to describe the feature. The special case can be seen in the code (where it's described as "a wart" in a comment).

You can get the same behaviour with GNU awk in BWK-awk compatibility mode, though.:

As a special case, in compatibility mode (see section Command-Line Options), if the argument to -F is ‘t’, then FS is set to the TAB character. If you type ‘-F\t’ at the shell, without any quotes, the ‘\’ gets deleted, so awk figures that you really want your fields to be separated with TABs and not ‘t’s.

Share:
6,518
FeRD
Author by

FeRD

Updated on January 05, 2023

Comments

  • FeRD
    FeRD over 1 year
     July 2022      mac os Monterey V12.1 
       awk --version 20200816
       GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin21)
    

    Why does awk -F work for most letters, but NOT for the letter t? I have the solution, but I would like to understand why awk fails for the letter t.

    # Count 'e's
    % echo "tweeter" | awk -F "e" '{print NF-1}'
    3
    
    # Count 'r's
    % echo "tweeter" | awk -F "r" '{print NF-1}'
    1
    
    # (Attempt to) count 't's
    % echo "tweeter" | awk -F "t" '{print NF-1}'
    0   <=== ????
    
    # Use gsub()
    % echo "tweeter" | awk '{print gsub(/t/, "")}'
    2
    
  • Admin
    Admin almost 2 years
    I wonder if Mr Kernighan was accommodating users who didn't quote the value: awk -F \t ... and if sh and csh handled that differently.
  • Admin
    Admin almost 2 years
    @glennjackman, the GNU documentation that Ed linked to there actually mentions just that. Both sh and csh drop the backslash there.
  • Admin
    Admin almost 2 years
    @QuartzCristal and Ed, it might not even matter, at least Wikipedia says the awk that FreeBSD uses is BWK awk / nawk. The version messages look the same between the macOS one and the Debian "original-awk" one. (And I'm not going to dig deeper. The one I have that came with macOS is an older version though)
  • Admin
    Admin almost 2 years
    FWIW, in rc -F\t remains as is as backslash is not a quoting operator there. In es or fish, -F\t becomes -F<TAB>.
  • Admin
    Admin almost 2 years
    The interpretation of "t" (it happens to be quoted, as it should, in the OP) as TAB seems like a misfeature of BSD's awk.
  • Admin
    Admin almost 2 years
    @QuartzCristal, it's just like that in the awk that Debian has in the original-awk package. Or the one available at github.com/onetrueawk/awk. (this function) It doesn't matter if you quote the t there, -F t, -F \t and -F "t" all give the same result in the shell: the two arguments -F and t.
  • Admin
    Admin almost 2 years
    Yes @ilkkachu . Both suffer of the same illness IMO. But if you have the (correct) habit of quoting the FS parameter as in -F "\t" (as it should be done) the shell will not change the value. Then it becomes shocking (as an odd surprise) that -F "t" gets interpreted as a TAB. And nawk (which some people supposedly claim that is the same binary as bwk) doesn´t do that (at least in Debian (nor Fedora, I believe)).
  • Admin
    Admin almost 2 years
    It's not described in POSIX; POSIX says that -F sepstring is equivalent to -v FS=sepstring, and no special handling for t is described of FS.
  • Admin
    Admin almost 2 years
    Where does \534 come from? Or the \564? Or \411...? Are those supposed to be backslash escapes for some character, or what?
  • Admin
    Admin almost 2 years
    @ilkkachu : octal codes for a single byte wrap around every 400 - those are ways of writing \134 and \011 but attempting to prevent a typical environment that may try to be too clever with \011 and replace them with their own interpretation of the tab, since the question here is to prevent any arbitrary interpretation of ur code before awk reads it. don't tell me u didn't know that