Why does awk -F work for most letters, but not for the letter "t"?
Because:
Normally, any number of blanks separate fields. In order to set the field separator to a single blank, use the -F option with a value of
[ ]
. If a field separator oft
is specified, awk treats it as if\t
had been specified and uses <TAB> as the field separator. In order to use a literalt
as the field separator, use the -F option with a value of[t]
.
That's from the FreeBSD awk man page, and the utilities that come with macOS are usually some old FreeBSD versions or such.
$ printf 'foo\tbar\n' | awk -F t '{print NF-1}'
1
$ echo total | awk -F '[t]' '{print NF-1}'
2
In a way, that seems like a useful shorthand for files with tab-separated values, but what with other letters taken as-is, it's confusing. It only works like that with -F
, using -v FS=t
doesn't do it.
The feature is non-POSIX, as POSIX says that -F x
is the same as -v FS=x
. Most other awks I tested treated t
as the the literal letter (some versions of gawk, mawk and Busybox).
The version of awk that e.g. Debian has in the original-awk
package ("One True AWK" or "BWK awk" presumably from Brian W. Kernighan's initials) does support it, though, and at least Wikipedia seems to indicate that would be the same software FreeBSD uses. This one appears to be based on the version described in the 1988 book "The AWK Programming Language", but I'm not an expert on awk lineages and don't know if it has evolved significantly since then. That one is on github, but the documentation there doesn't seem to describe the feature. The special case can be seen in the code (where it's described as "a wart" in a comment).
You can get the same behaviour with GNU awk in BWK-awk compatibility mode, though.:
As a special case, in compatibility mode (see section Command-Line Options), if the argument to -F is ‘t’, then FS is set to the TAB character. If you type ‘-F\t’ at the shell, without any quotes, the ‘\’ gets deleted, so awk figures that you really want your fields to be separated with TABs and not ‘t’s.
FeRD
Updated on January 05, 2023Comments
-
FeRD 5 months
July 2022 mac os Monterey V12.1 awk --version 20200816 GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin21)
Why does
awk -F
work for most letters, but NOT for the lettert
? I have the solution, but I would like to understand why awk fails for the lettert
.# Count 'e's % echo "tweeter" | awk -F "e" '{print NF-1}' 3 # Count 'r's % echo "tweeter" | awk -F "r" '{print NF-1}' 1 # (Attempt to) count 't's % echo "tweeter" | awk -F "t" '{print NF-1}' 0 <=== ???? # Use gsub() % echo "tweeter" | awk '{print gsub(/t/, "")}' 2
-
Admin 11 monthsI wonder if Mr Kernighan was accommodating users who didn't quote the value:
awk -F \t ...
and if sh and csh handled that differently. -
Admin 11 months@glennjackman, the GNU documentation that Ed linked to there actually mentions just that. Both sh and csh drop the backslash there.
-
Admin 11 months@QuartzCristal and Ed, it might not even matter, at least Wikipedia says the awk that FreeBSD uses is BWK awk / nawk. The version messages look the same between the macOS one and the Debian "original-awk" one. (And I'm not going to dig deeper. The one I have that came with macOS is an older version though)
-
Admin 11 monthsFWIW, in rc
-F\t
remains as is as backslash is not a quoting operator there. Ines
orfish
,-F\t
becomes-F<TAB>
. -
Admin 11 monthsThe interpretation of
"t"
(it happens to be quoted, as it should, in the OP) as TAB seems like a misfeature of BSD's awk. -
Admin 11 months@QuartzCristal, it's just like that in the awk that Debian has in the
original-awk
package. Or the one available at github.com/onetrueawk/awk. (this function) It doesn't matter if you quote thet
there,-F t
,-F \t
and-F "t"
all give the same result in the shell: the two arguments-F
andt
. -
Admin 11 monthsYes @ilkkachu . Both suffer of the same illness IMO. But if you have the (correct) habit of quoting the FS parameter as in
-F "\t"
(as it should be done) the shell will not change the value. Then it becomes shocking (as an odd surprise) that-F "t"
gets interpreted as a TAB. Andnawk
(which some people supposedly claim that is the same binary as bwk) doesn´t do that (at least in Debian (nor Fedora, I believe)). -
Admin 11 monthsIt's not described in POSIX; POSIX says that
-F sepstring
is equivalent to-v FS=sepstring
, and no special handling fort
is described ofFS
. -
Admin 11 monthsWhere does
\534
come from? Or the\564
? Or\411
...? Are those supposed to be backslash escapes for some character, or what? -
Admin 11 months@ilkkachu : octal codes for a single byte wrap around every 400 - those are ways of writing
\134
and\011
but attempting to prevent a typical environment that may try to be too clever with\011
and replace them with their own interpretation of the tab, since the question here is to prevent any arbitrary interpretation of ur code beforeawk
reads it. don't tell me u didn't know that