Find Duplicate/Repeated or Unique words spanning across multiple lines in a file
Solution
grep
grep -wo "[[:alnum:]]\+" input_file.txt | sort | uniq [-c | -d | -u]
egrep (permits regex meta-characters without escaping them)
egrep -wo "[[:alnum:]]+" input_file.txt | sort | uniq [-c | -d | -u]
Explanation
First you can tokenize the words with
grep -wo
, each word is printed on a singular line.Then you can sort the tokenized words with
sort
.-
Finally can find consecutive unique or duplicate words with
uniq
.3.1.
uniq -c
This prints the words and their count. Covering all matched words -- duplicate and unique.3.2.
uniq -d
This prints all duplicate words.3.3.
uniq -u
This prints all unique words.
Examples
Sample Input
abc line 1
xyz zzz
123 456
abc end line
Example 1 -- duplicate/unique words with their count:
grep -wo '[[:alnum:]]\+' input_file.txt | sort | uniq -c
Output:
1 1
1 123
1 456
2 abc
1 end
2 line
1 xyz
1 zzz
Example 2 -- duplicate words only:
grep -wo '[[:alnum:]]\+' infile | sort | uniq -d
Output:
abc
line
Example 3 -- unique words only:
grep -wo '[[:alnum:]]\+' infile | sort | uniq -u
Output
1
123
456
end
xyz
zzz
Command Dissection & Sources
-
grep
parameters -
grep
regex expression-
[[:alnum:]]
Alphanumeric cahracters -
\+
Kleene Plus character. Matches one or more occurrence.
-
sort
-
uniq
-
-c
Prints words with their repetition count. -
-d
Prints only repeated (duplicated) lines. -
-u
Prints only nonrepeated (unique) lines.
-
Related videos on Youtube
om-ha
Aspire to pragmatically achieve high-quality maintainable logic within a strict time constraint.
Updated on September 18, 2022Comments
-
om-ha over 1 year
Problem
Is it possible to print repeated words that are not unique and spanning across multiple lines? not just unique words within singular lines.
Previous work
There's this question which solves the problem of finding duplicate words within the same line. It also has an issue that it matches the ending word boundary with the starting one.
Sample Input
[ { entity: { id: int employee: { id: int company: { name: string area: { country: string city: string zipcode: string } } person: { id: int firstName: string middleName: string lastName: string } } } entity: { id: int person: { id: int firstName: string middleName: string lastName: string } area: { country: string city: string zipcode: string } } } ]
Sample Output -- REPEATED/DUPLICATE
area city country entity firstName id int lastName middleName person string zipcode
Sample Output -- UNIQUE
company employee name
-
Ed Morton over 4 yearsPlease edit your question to make it stand-alone, including a MCVE with concise, testable sample input and expected output so we can help you. It's trivial to print unique or duplicated words from a block of multi-line text but the question you reference was trying to find contiguous repetitions of the same word which is a harder problem - is that what this questions is about? How to do that when the word is repeated on the next line instead of on the same line?
-
Ed Morton over 4 yearsFWIW I just added an answer to the question you reference showing how to remove repeated words from multi-line text.
-
om-ha over 4 yearsEdited my answer providing an example input and output.
-
Ed Morton over 4 yearsI think may have misunderstood the word "repeated" in the previous question. They were looking to identify a word like
dog
repeated in a string likethe dog dog barked
, not just the worddog
duplicated anywhere in the input (e.g.the dog barked at another dog
). Finding duplicate or unique words in a block of text is trivial, it's finding repeated words that's interesting and removing the repetition that's actually relatively difficult.
-
-
Ed Morton over 4 yearsNone of that finds just repeated words as in the the other other question that you reference in your question.
-
om-ha over 4 yearsI'm talking about finding a repeated keyword within the entire file regardless if consecutive or not. Edited my question and self-answer.