Can I determine type of an awk variable?

6,175

Solution 1

Awk has 4 types: "number", "string", "numeric string" and "undefined". Here is a function to detect that:

function o_class(obj,   q, x, z) {
  q = CONVFMT
  CONVFMT = "% g"
    split(" " obj "\1" obj, x, "\1")
    x[1] = obj == x[1]
    x[2] = obj == x[2]
    x[3] = obj == 0
    x[4] = obj "" == +obj
  CONVFMT = q
  z["0001"] = z["1101"] = z["1111"] = "number"
  z["0100"] = z["0101"] = z["0111"] = "string"
  z["1100"] = z["1110"] = "strnum"
  z["0110"] = "undefined"
  return z[x[1] x[2] x[3] x[4]]
}

For the third argument of split, you need something that is not a space, and not part of obj or else it will be treated as a delimiter. I chose \1 based on Stéphane suggestion. The function does internal CONVFMT toggling, so it should return the correct result regardless of CONVFMT value at the time of the function call:

split("12345.6", q); print 1, o_class(q[1])
CONVFMT = "%.5g"; split("12345.6", q); print 2, o_class(q[1])
split("nan", q); print 3, o_class(q[1])
CONVFMT = "%.6G"; split("nan", q); print 4, o_class(q[1])

Result:

1 strnum
2 strnum
3 strnum
4 strnum

Full test suite:

print 1, o_class(0)
print 2, o_class(1)
print 3, o_class(123456.7)
print 4, o_class(1234567.8)
print 5, o_class(+"inf")
print 6, o_class(+"nan")
print 7, o_class("")
print 8, o_class("0")
print 9, o_class("1")
print 10, o_class("inf")
print 11, o_class("nan")
split("00", q); print 12, o_class(q[1])
split("01", q); print 13, o_class(q[1])
split("nan", q); print 14, o_class(q[1])
split("12345.6", q); print 15, o_class(q[1])
print 16, o_class()

Result:

1 number
2 number
3 number
4 number
5 number
6 number
7 string
8 string
9 string
10 string
11 string
12 strnum
13 strnum
14 strnum
15 strnum
16 undefined

The notable weakness is: if you provide "numeric string" of any of the following, the function will incorrectly return "number":

  • integer
  • inf
  • -inf

For integers, this is explained:

A numeric value that is exactly equal to the value of an integer shall be converted to a string by the equivalent of a call to the sprintf function with the string %d as the fmt argument

However inf and -inf behave this way as well; that is to say that none of the above can be influenced by the CONVFMT variable:

CONVFMT = "% g"
print "" .1
print "" (+"nan")
print "" 1
print "" (+"inf")
print "" (+"-inf")

Result:

 0.1
 nan
1
inf
-inf

In practice this doesn’t really matter, see the Duck test.

Solution 2

With gawk, PROCINFO["identifiers"] is an array with information about variables. Use it like: PROCINFO["identifiers"]["your_variable_name"]. The possible value returned is one of "array", "builtin", "extension", "scalar", "untyped", "user".

There is only a general scalar, which includes both strings and numbers. The gawk interpreter just tries its best with doing stuff.

There is a reason why sometimes you'll see a seemingly redundant variable + 0 somewhere, to ensure awk treats the variable as a numeric one.

See this paragraph for some of the trickery with implicit conversions.

Solution 3

To clarify, only strings that are coming from a few sources (here quoting the POSIX spec):

  1. Field variables
  2. Input from the getline() function
  3. FILENAME
  4. ARGV array elements
  5. ENVIRON array elements
  6. Array elements created by the split() function
  7. A command line variable assignment
  8. Variable assignment from another numeric string variable

are to be considered a numeric string if their value happens to be numerical (allowing leading and trailing blanks, with variations between implementations in support for hex, octal, inf, nan...).

The "3.14" literal string constant is a string, not strnum, because it doesn't come from one of those sources.

x = "3.14"; if (x == 3.14) print "yes"

prints yes, but that's because it's doing a lexical comparison (depending on the implementation, using memcmp(), strcmp() or strcollate()) of 3.14 and the conversion to a string (via the CONVFMT format string, %.6g in gawk and many other implementations) of that 3.14 number. That is, with that value of CONVFMT, (x == 3.14) is the same as (x == "3.14").

(x < 12) would be false, because 3.14 sorts lexically after 12 (same as ("3.14" < "12")). With CONVFMT = "%.6e", (x == 3.14) would also return false because that becomes ("3.14" == "3.140000e+00").

On the other hand, in:

"echo \"3.1400 \"" | getline x
if (x == 3.14) print "yes"
if (x < 12) print "yes"

yes is printed twice whatever the value of CONVFMT, because a numerical comparison is performed. x is a strnum because it comes from getline and has a numeric value.

It still retains its string value though. print x will print "3.1400 " whatever the value of OFMT or CONVFMT.

And:

"echo 3.14 foo" | getline x
if (x == 3.14) print "yes"

Doesn't print yes. x comes from getline but doesn't have a numeric value (because of the foo). It is a normal string, as if you had written x = "3.14 foo". Still, you will be able to do numeric operations with it:

print x + 1

Will output 4.14. Above, because it is involved in a numeric operation, the string is converted to a number by looking at the initial part (past the eventual blanks) that looks like a number at the start of a string.

So (x+0 == 3.14) and (x+0 < 12) will also return true. x+0 is numeric, so we've got a numeric comparison.

Note that inf, nan, Infinity are not recognised as the floating point inf or nan special values as constants, but in several awk implementations, you can use ("inf"+0) instead.

Solution 4

From GNU Awk 4.2, there is a new function typeof() to check this, as indicated in the release notes from the beta release:

  1. The new typeof() function can be used to indicate if a variable or array element is an array, regexp, string or number. The isarray() function is deprecated in favor of typeof().

So now you can say:

$ awk 'BEGIN {print typeof("a")}'
string
$ awk 'BEGIN {print typeof(1)}'
number
$ awk 'BEGIN {print typeof(a[1])}'
unassigned
$ awk 'BEGIN {a[1]=1; print typeof(a)}'
array
$ echo ' 1 ' | awk '{print typeof($0)}'
strnum
Share:
6,175

Related videos on Youtube

Utku
Author by

Utku

Updated on September 18, 2022

Comments

  • Utku
    Utku almost 2 years

    I have the gawk version of awk. In this part of gawk manual, it is stated that awk variables have "attributes", which are used to determine how to treat them in various operations.

    For example, a string that is of the form " +3.14" which is obtained by parsing the input has the STRNUM attribute, which makes it behave as a number in a comparison with a number, whereas the same string defined in an awk program does not have this attribute.

    OTOH, a string like "3.14" apparently has STRNUM attribute, even if it was defined in the program because the code x = "3.14" { print x == 3.14 } prints 1. Whereas if we define it as "+3.13" or " 3.14", it does not have STRNUM attribute since the x = "+3.14" { print x == 3.14 } or x = " 3.14" { print x == 3.14 } prints 0.

    I think that such succinctness in variable typing may cause subtle bugs. Hence, in order to aid in debugging such situations, is there a way to learn what type of "attributes" a variable has? I.e, can we learn what is the type of a variable?

    • Utku
      Utku about 8 years
      @123 I know that unless I use arithmetic operators on it, then it will be treated as a string. But if I use arithmetic operators on it, it will be treated as a number, whereas this is not the case for such string manually defined in an awk program.
    • 123
      123 about 8 years
      If you use arithmetic operators on any variable then it will be treated as a number no matter how it was defined.
  • Stéphane Chazelas
    Stéphane Chazelas about 7 years
    Sorry for the confusion, I didn't say it was only a problem with the original awk implementation but with those implementations based on (derived from) it. That includes the awk of current versions of Solaris, FreeBSD, macOS and I suppose most other commercial Unices. Yes, that's a bug, but a widespread one with an easy work around. Note that gawk is almost as ancient (1986). The real awk still maintained by Brian Kernighan (the k in awk) does incorporate features from gawk.
  • Justin
    Justin about 7 years
    Any way to have this check if the value is an array as well?
  • fedorqui
    fedorqui over 6 years
    @Stéphane it is funny how echo ' 1 ' | awk '{print typeof($0)}' returns "strnum", while awk 'BEGIN{print typeof(" 1 ")}' return "string". Any hint on why is like this?
  • Stéphane Chazelas
    Stéphane Chazelas over 6 years
    that's the whole point of this Q&A. See my answer or Steven's