Print lines where first field has only four characters using regex in awk?
Solution 1
Fields in awk are per default delimited by "", this means
$1
doesn't contain a space, so the correct regex for $1
is:
awk '$1 ~ /^[a-zA-Z0-9]{4}$/ {print}' file
If you want to keep your original approach you can also just use $0
instead, i.e:
awk '$0 ~ /^[a-zA-Z0-9]{4}\s/ {print}' file
To simplify things you can also use \w
instead of explicitly defining word characters, i.e:
awk '$0 ~ /^\w{4}\s/ {print}' file
If you only want to match the space and not something else like TAB
you just have to replace \s
with "" (without the quotation marks).
Another issue with your original approach are the missing anchors. As you didn't specify either ^
nor $
your pattern can occur anywhere, i.e the pattern would match for Elizabeth Stachelin
with beth
.
Solution 2
In AWK, you can use regular expression as a pattern like BEGIN
or END
you often see in AWK script. A simplified code can be like
awk '/^[[:alnum:]]{4}\>/'
This is all you need to meet you needs. You do not need an action, {print}
is the default action when a patten matched, which prints the entire record, i.e. the entire line.
[:alnum:]
is a synonym to [a-zA-Z0-9]
basically, depending on locale. You can also use \w
—only it also includes underscore _
, it's a shorthand of [[:alnum:]_]
:
awk '/^\w{4}\>/'
\>
matches the end of a word. By using it, you can match string like John:(###)...
correctly, if you have records which do not contain the full names.
Although you are asking AWK, but I would suggest using sed
, it runs almost twice as fast as AWK in the case:
sed -n '/^[[:alnum:]]\{4\}\b/p'
\b
is \>
or \<
in AWK. I tested on a 500K lines, 100K lines matched, AWK took around 1.7 seconds, sed only took 0.9 seconds. But the test case is extreme, it's just a nitpick suggestion.
I would also suggest you read man 7 regex
as well as man awk
and info awk
.
Solution 3
The first field is $1
, and its length is length($1)
, so:
awk 'length($1) == 4 {print}'
or more succintly
awk 'length($1) == 4'
What you wrote doesn't work for two reasons. First, you have an extra " "
in your regexp, so you're requiring that the fields contains double quote, space, double quote. If you fix that, you get /[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]/
, which matches a field that contains at least four ASCII letters or digits, but may contain more, so it will match Elizabeth
as well as John
, but not Tom
. You can write /^[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]$/
to anchor the regexp at the start and end, but if what you're after is the length of the field, just write that.
Related videos on Youtube
Ezequiel
Updated on September 18, 2022Comments
-
Ezequiel over 1 year
John Goldenrod:(916) 348-4278:250:100:175 Chet Main:(510) 548-5258:50:95:135 Tom Savage:(408) 926-3456:250:168:200 Elizabeth Stachelin:(916) 440-1763:175:75:300
output should contain the lines containing names with only four characters (john,chet) :
awk '$1 ~ /[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]" "/ {print}' file
this doesn't seem to work for me. can i do it without using any of the awk functions.