How do I grep for all words that are less than 4 characters?
14,679
Solution 1
You can just do:
egrep -x '.{1,3}' myfile
This will also skip blank lines, which are technically not words. Unfortunately, the above reg-ex will count apostrophes in contractions as letters as well as hyphens in hyphenated compound words. Hyphenated compound words are not a problem at such a low letter count, but I am not sure whether or not you want to count apostrophes in contractions, which are possible (e.g., I'm). You can try to use a reg-ex such as:
egrep -x '\w{1,3}' myfile
..., but this will only match upper/lower case letters and not match contractions or hyphenated compound words at all.
Solution 2
Like this:
grep -v "^...." my_file
Solution 3
Try this regular expression:
grep -E '^.{1,3}$' your_dictionary
Author by
TIMEX
Updated on July 11, 2022Comments
-
TIMEX almost 2 years
I have a dictionary with words separated by line breaks.
-
tchrist about 13 yearsActually, it's worse than that:
\w
is messed up in GNU grep because a pattern like^\w
fails on strings like like "β-oxidation" and "γ-aminobutyric". I would runperl -CSD -ne 'print if /^\W*(\w\W*){1,3}$/'
, because that way it handles contractions and hyphenated words but doesn’t count the non-word characters towards it limit of 3. If you care about actual letters, you can use\pL
and\PL
instead of\w
and\W
, which match more broadly than that or even\p{Alphabetic}
, per UTS#18’s requirements. -
Paul Tomblin over 9 years@cbmanica, no, you are incorrect. "grep -v" finds all lines that don't match, and I'm matching any line with 5 or more characters. In other words, it returns any line with 4 or fewer characters.
-
cbmanica over 9 yearsGiven that OP wanted to find words that are "less than 4 characters", I'm afraid you've confirmed my assertion that your answer is incorrect.
-
THESorcerer about 9 yearsyes, is less, not less or equal, anyway, is a good idea and got the point