Find Duplicate/Repeated or Unique words spanning across multiple lines in a file

bash text-processing grep regular-expression

5,621

Solution

grep

grep -wo "[[:alnum:]]\+" input_file.txt | sort | uniq [-c | -d | -u]

egrep (permits regex meta-characters without escaping them)

egrep -wo "[[:alnum:]]+" input_file.txt | sort | uniq [-c | -d | -u]

Explanation

First you can tokenize the words with grep -wo, each word is printed on a singular line.
Then you can sort the tokenized words with sort.
Finally can find consecutive unique or duplicate words with uniq.

3.1. uniq -c This prints the words and their count. Covering all matched words -- duplicate and unique.

3.2. uniq -d This prints all duplicate words.

3.3. uniq -u This prints all unique words.

Examples

Sample Input

abc line 1
xyz zzz
123 456
abc end line

Example 1 -- duplicate/unique words with their count:

grep -wo '[[:alnum:]]\+' input_file.txt | sort | uniq -c

Output:

   1 1
   1 123
   1 456
   2 abc
   1 end
   2 line
   1 xyz
   1 zzz

Example 2 -- duplicate words only:

grep -wo '[[:alnum:]]\+' infile | sort | uniq -d

Output:

abc
line

Example 3 -- unique words only:

grep -wo '[[:alnum:]]\+' infile | sort | uniq -u

Output

1
123
456
end
xyz
zzz

Command Dissection & Sources

grep parameters
- -w tokenizes whole words separated by word boundaries (non-word characters \W)
- -o prints only the matched non-empty parts of matching line -- i.e. in our case prints the matched non-empty words only within the matched line
grep regex expression
- [[:alnum:]] Alphanumeric cahracters
- \+ Kleene Plus character. Matches one or more occurrence.
sort
uniq
- -c Prints words with their repetition count.
- -d Prints only repeated (duplicated) lines.
- -u Prints only nonrepeated (unique) lines.

5,621

om-ha

Aspire to pragmatically achieve high-quality maintainable logic within a strict time constraint.

Updated on September 18, 2022

Comments

om-ha over 1 year
Problem

Is it possible to print repeated words that are not unique and spanning across multiple lines? not just unique words within singular lines.

Previous work

There's this question which solves the problem of finding duplicate words within the same line. It also has an issue that it matches the ending word boundary with the starting one.

Sample Input
```
[
    {
        entity: 
        {
            id: int
            employee:
            {
                id: int
                company: {
                    name: string
                    area: 
                    {
                        country: string
                        city: string
                        zipcode: string
                    }
                }
                person: 
                {
                    id: int
                    firstName: string
                    middleName: string
                    lastName: string
                }
            }
        }
        entity: 
        {
            id: int
            person: 
            {
                id: int
                firstName: string
                middleName: string
                lastName: string
            }
            area: 
            {
                country: string
                city: string
                zipcode: string
            }
        }
    }
]
```
Sample Output -- REPEATED/DUPLICATE
```
area
city
country
entity
firstName
id
int
lastName
middleName
person
string
zipcode
```
Sample Output -- UNIQUE
```
company
employee
name
```
- Ed Morton over 4 years
  
  Please edit your question to make it stand-alone, including a MCVE with concise, testable sample input and expected output so we can help you. It's trivial to print unique or duplicated words from a block of multi-line text but the question you reference was trying to find contiguous repetitions of the same word which is a harder problem - is that what this questions is about? How to do that when the word is repeated on the next line instead of on the same line?
- Ed Morton over 4 years
  
  FWIW I just added an answer to the question you reference showing how to remove repeated words from multi-line text.
- om-ha over 4 years
  
  Edited my answer providing an example input and output.
- Ed Morton over 4 years
  
  I think may have misunderstood the word "repeated" in the previous question. They were looking to identify a word like dog repeated in a string like the dog dog barked, not just the word dog duplicated anywhere in the input (e.g. the dog barked at another dog). Finding duplicate or unique words in a block of text is trivial, it's finding repeated words that's interesting and removing the repetition that's actually relatively difficult.
Ed Morton over 4 years

None of that finds just repeated words as in the the other other question that you reference in your question.
om-ha over 4 years

I'm talking about finding a repeated keyword within the entire file regardless if consecutive or not. Edited my question and self-answer.