Find Duplicate/Repeated or Unique words spanning across multiple lines in a file

5,621

Solution

grep

grep -wo "[[:alnum:]]\+" input_file.txt | sort | uniq [-c | -d | -u]

egrep (permits regex meta-characters without escaping them)

egrep -wo "[[:alnum:]]+" input_file.txt | sort | uniq [-c | -d | -u]

Explanation

  1. First you can tokenize the words with grep -wo, each word is printed on a singular line.

  2. Then you can sort the tokenized words with sort.

  3. Finally can find consecutive unique or duplicate words with uniq.

    3.1. uniq -c This prints the words and their count. Covering all matched words -- duplicate and unique.

    3.2. uniq -d This prints all duplicate words.

    3.3. uniq -u This prints all unique words.

Examples

Sample Input

abc line 1
xyz zzz
123 456
abc end line

Example 1 -- duplicate/unique words with their count:

grep -wo '[[:alnum:]]\+' input_file.txt | sort | uniq -c

Output:

   1 1
   1 123
   1 456
   2 abc
   1 end
   2 line
   1 xyz
   1 zzz

Example 2 -- duplicate words only:

grep -wo '[[:alnum:]]\+' infile | sort | uniq -d

Output:

abc
line

Example 3 -- unique words only:

grep -wo '[[:alnum:]]\+' infile | sort | uniq -u

Output

1
123
456
end
xyz
zzz

Command Dissection & Sources

  • grep parameters
    • -w tokenizes whole words separated by word boundaries (non-word characters \W)
    • -o prints only the matched non-empty parts of matching line -- i.e. in our case prints the matched non-empty words only within the matched line
  • grep regex expression
    • [[:alnum:]] Alphanumeric cahracters
    • \+ Kleene Plus character. Matches one or more occurrence.
  • sort
  • uniq
    • -c Prints words with their repetition count.
    • -d Prints only repeated (duplicated) lines.
    • -u Prints only nonrepeated (unique) lines.
Share:
5,621

Related videos on Youtube

om-ha
Author by

om-ha

Aspire to pragmatically achieve high-quality maintainable logic within a strict time constraint.

Updated on September 18, 2022

Comments

  • om-ha
    om-ha over 1 year

    Problem

    Is it possible to print repeated words that are not unique and spanning across multiple lines? not just unique words within singular lines.

    Previous work

    There's this question which solves the problem of finding duplicate words within the same line. It also has an issue that it matches the ending word boundary with the starting one.

    Sample Input

    [
        {
            entity: 
            {
                id: int
                employee:
                {
                    id: int
                    company: {
                        name: string
                        area: 
                        {
                            country: string
                            city: string
                            zipcode: string
                        }
                    }
                    person: 
                    {
                        id: int
                        firstName: string
                        middleName: string
                        lastName: string
                    }
                }
            }
            entity: 
            {
                id: int
                person: 
                {
                    id: int
                    firstName: string
                    middleName: string
                    lastName: string
                }
                area: 
                {
                    country: string
                    city: string
                    zipcode: string
                }
            }
        }
    ]
    

    Sample Output -- REPEATED/DUPLICATE

    area
    city
    country
    entity
    firstName
    id
    int
    lastName
    middleName
    person
    string
    zipcode
    

    Sample Output -- UNIQUE

    company
    employee
    name
    
    • Ed Morton
      Ed Morton over 4 years
      Please edit your question to make it stand-alone, including a MCVE with concise, testable sample input and expected output so we can help you. It's trivial to print unique or duplicated words from a block of multi-line text but the question you reference was trying to find contiguous repetitions of the same word which is a harder problem - is that what this questions is about? How to do that when the word is repeated on the next line instead of on the same line?
    • Ed Morton
      Ed Morton over 4 years
      FWIW I just added an answer to the question you reference showing how to remove repeated words from multi-line text.
    • om-ha
      om-ha over 4 years
      Edited my answer providing an example input and output.
    • Ed Morton
      Ed Morton over 4 years
      I think may have misunderstood the word "repeated" in the previous question. They were looking to identify a word like dog repeated in a string like the dog dog barked, not just the word dog duplicated anywhere in the input (e.g. the dog barked at another dog). Finding duplicate or unique words in a block of text is trivial, it's finding repeated words that's interesting and removing the repetition that's actually relatively difficult.
  • Ed Morton
    Ed Morton over 4 years
    None of that finds just repeated words as in the the other other question that you reference in your question.
  • om-ha
    om-ha over 4 years
    I'm talking about finding a repeated keyword within the entire file regardless if consecutive or not. Edited my question and self-answer.