How do I grep for all words that are less than 4 characters?

linux unix grep

14,679

Solution 1

You can just do:

egrep -x '.{1,3}' myfile

This will also skip blank lines, which are technically not words. Unfortunately, the above reg-ex will count apostrophes in contractions as letters as well as hyphens in hyphenated compound words. Hyphenated compound words are not a problem at such a low letter count, but I am not sure whether or not you want to count apostrophes in contractions, which are possible (e.g., I'm). You can try to use a reg-ex such as:

egrep -x '\w{1,3}' myfile

..., but this will only match upper/lower case letters and not match contractions or hyphenated compound words at all.

Solution 2

Like this: grep -v "^...." my_file

Solution 3

Try this regular expression:

grep -E '^.{1,3}$' your_dictionary

14,679

Author by

TIMEX

Updated on July 11, 2022

Comments

TIMEX almost 2 years

I have a dictionary with words separated by line breaks.
tchrist about 13 years

Actually, it's worse than that: \w is messed up in GNU grep because a pattern like ^\w fails on strings like like "β-oxidation" and "γ-aminobutyric". I would run perl -CSD -ne 'print if /^\W*(\w\W*){1,3}$/', because that way it handles contractions and hyphenated words but doesn’t count the non-word characters towards it limit of 3. If you care about actual letters, you can use \pL and \PL instead of \w and \W, which match more broadly than that or even \p{Alphabetic}, per UTS#18’s requirements.
Paul Tomblin over 9 years

@cbmanica, no, you are incorrect. "grep -v" finds all lines that don't match, and I'm matching any line with 5 or more characters. In other words, it returns any line with 4 or fewer characters.
cbmanica over 9 years

Given that OP wanted to find words that are "less than 4 characters", I'm afraid you've confirmed my assertion that your answer is incorrect.
THESorcerer about 9 years

yes, is less, not less or equal, anyway, is a good idea and got the point