Print lines where first field has only four characters using regex in awk?

9,249

Solution 1

Fields in awk are per default delimited by "", this means $1 doesn't contain a space, so the correct regex for $1 is:

awk '$1 ~ /^[a-zA-Z0-9]{4}$/ {print}' file

If you want to keep your original approach you can also just use $0 instead, i.e:

awk '$0 ~ /^[a-zA-Z0-9]{4}\s/ {print}' file

To simplify things you can also use \w instead of explicitly defining word characters, i.e:

awk '$0 ~ /^\w{4}\s/ {print}' file

If you only want to match the space and not something else like TAB you just have to replace \s with "" (without the quotation marks).

Another issue with your original approach are the missing anchors. As you didn't specify either ^ nor $ your pattern can occur anywhere, i.e the pattern would match for Elizabeth Stachelin with beth.

Solution 2

In AWK, you can use regular expression as a pattern like BEGIN or END you often see in AWK script. A simplified code can be like

awk '/^[[:alnum:]]{4}\>/'

This is all you need to meet you needs. You do not need an action, {print} is the default action when a patten matched, which prints the entire record, i.e. the entire line.

[:alnum:] is a synonym to [a-zA-Z0-9] basically, depending on locale. You can also use \w—only it also includes underscore _, it's a shorthand of [[:alnum:]_]:

awk '/^\w{4}\>/'

\> matches the end of a word. By using it, you can match string like John:(###)... correctly, if you have records which do not contain the full names.

Although you are asking AWK, but I would suggest using sed, it runs almost twice as fast as AWK in the case:

sed -n '/^[[:alnum:]]\{4\}\b/p'

\b is \> or \< in AWK. I tested on a 500K lines, 100K lines matched, AWK took around 1.7 seconds, sed only took 0.9 seconds. But the test case is extreme, it's just a nitpick suggestion.

I would also suggest you read man 7 regex as well as man awk and info awk.

Solution 3

The first field is $1, and its length is length($1), so:

awk 'length($1) == 4 {print}'

or more succintly

awk 'length($1) == 4'

What you wrote doesn't work for two reasons. First, you have an extra " " in your regexp, so you're requiring that the fields contains double quote, space, double quote. If you fix that, you get /[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]/, which matches a field that contains at least four ASCII letters or digits, but may contain more, so it will match Elizabeth as well as John, but not Tom. You can write /^[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]$/ to anchor the regexp at the start and end, but if what you're after is the length of the field, just write that.

Share:
9,249

Related videos on Youtube

Ezequiel
Author by

Ezequiel

Updated on September 18, 2022

Comments

  • Ezequiel
    Ezequiel over 1 year
    John Goldenrod:(916) 348-4278:250:100:175
    
    Chet Main:(510) 548-5258:50:95:135
    
    Tom Savage:(408) 926-3456:250:168:200
    
    Elizabeth Stachelin:(916) 440-1763:175:75:300
    

    output should contain the lines containing names with only four characters (john,chet) :

    awk '$1 ~ /[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]" "/ {print}' file
    

    this doesn't seem to work for me. can i do it without using any of the awk functions.