How to get the unique count of a particular part of a string

9,273

Solution 1

With grep, filter out just the numbers:

grep -Eo '[0-9]+-' file | sort -u | wc -l
  • [0-9] Matches any character between 0 and 9 (any digit).
  • + in extended regular expressions stands for at least one character (that's why the -E option is used with grep). So [0-9]+- matches one or more digits, followed by -.
  • -o only prints the part that matched your pattern, so given input abcd23-gf56, grep will only print 23-.
  • sort -u sorts and filters unique entries (due to -u), and wc -l counts the number of lines in input (hence, the number of unique entries).

Solution 2

Use extended grep to and look for four digits, telling grep to only list the matches (as opposed to the whole line, which is the default):

grep -Eo '[0-9]+' <filename>

Sort this list of numbers and only output unique ones:

sort -u

Count the number of lines:

wc -l

Put it all together:

$ grep -Eo '[0-9]+' filename | sort -u | wc -l
8

Solution 3

You can use:

tr -dc '\-0-9\n' | sort -u -t- -nk1,1 | grep -c .

...which is, admittedly, more than a little inspired by muru's answer here. Differently, though, I use grep to count the lines rather than wc in case there are blank lines in input. His answer doesn't have a blank line problem as grep -o will only print lines with their match (as grep -c only counts them here), but tr does print blank lines because the newline is one of the few characters it does not delete. This means any number of blank lines in input would skew wc's results by one.

So tr here is more efficient probably than is grep -o but likely wc beats grep in the counting department. I like it this way I think for portability reasons, and also because I usually try to prune data with the most efficient filter first, and to use the less efficient ones later in the chain.

This lets sort pick the bits per line that it will consider in its -unique sort based on its -numeric sort -key which it splits on its -t-ab delimiter. tr -deletes the -complement of any numeric, dash, or newline byte in its input. That way - so long as there are no -dashes occurring before the numeric strings you wish to compare, then the only thing remaining to any line is:

#nothing at all

...or...

[numbers]

...or...

[numbers]-[more numbers]more-dashes-...

So when the output is piped to sort we instruct it only to compare numeric strings occurring before a dash if any. In that way -dashes or not - the only numbers which matter are the ones you want to count.

So then we grep -count lines containing at least a single . character. The following command prints 8:

tr -dc '\-0-9\n' <<\IN | sort -u -t- -nk1,1 | grep -c .
psf7433-nlhrms
unit7433-nobody
unit7333-opera
bpx7333-operations
app7333-osm
unit7330-partners
psf7331-pdesmond
unit7333-pro-09-0jm
mnp7330-redir09o-0ect
unit7333-retailbanking
cpq7333-rkarmer
unit6333-sales
ring7323-support


unit7133-telco
post7323-uadb
sun7335-ukhrms
burp7133-wfnmreply
IN
Share:
9,273

Related videos on Youtube

Karthik
Author by

Karthik

Updated on September 18, 2022

Comments

  • Karthik
    Karthik over 1 year

    I have a set of data in a file.

    psf7433-nlhrms
    unit7433-nobody
    unit7333-opera
    bpx7333-operations
    app7333-osm
    unit7330-partners
    psf7331-pdesmond
    unit7333-projm
    mnp7330-redirect
    unit7333-retailbanking
    cpq7333-rkarmer
    unit6333-sales
    ring7323-support
    unit7133-telco
    post7323-uadb
    sun7335-ukhrms
    burp7133-wfnmreply
    

    How to ignore the starting alphabetic characters in each line and the characters after the numeric and get the count of the unique numbers. (or) How to retrieve only the numeric value in each line and get their unique count.

    Considering we manage to extract only the numeric values, we will get this.

    7433
    7433
    7333
    7333
    7333
    7330
    7331
    7333
    7330
    7333
    7333
    6333
    7323
    7133
    7323
    7335
    7133
    

    Now, I want the unique count of the retrieved numeric values. So ignoring the repetitions, I should get the following final output.

    8
    

    I am unable to do this either by using awk or sed or even simple grep | cut

    I do not want the list of extracted values, I want only the final count as the answer.

    Help me!

  • Karthik
    Karthik over 9 years
    Hi, I just modified the question with a new set of input. Will the code be applicable for the new input too.
  • Gilles Quenot
    Gilles Quenot over 9 years
    POST edited accordingly
  • Karthik
    Karthik over 9 years
    i'm new to shell scripting so can you explain what this script does in short. thanks a ton!
  • Karthik
    Karthik over 9 years
    gives me the answer 0
  • Gilles Quenot
    Gilles Quenot over 9 years
    You have to modify file by your own file Karthik
  • Karthik
    Karthik over 9 years
    works after the edit! just a doubt...will this piece of script work even if one of the file entries contains a line like sun73-ukhrms. where the numeric value is just 2 digits. Moreover, hope this script does not read any numeric value after the '-' character in the line.
  • Karthik
    Karthik over 9 years
    as per my requirement, the script should ignore any numeric value after the '-' character in the input (if any). For eg: if the input is sun7335-ukhrms8768, it should not read 8768 as it falls after the '-' character.
  • Gilles Quenot
    Gilles Quenot over 9 years
    New requirement satisfied chief
  • muru
    muru over 9 years
    @Karthik Does the - appear only once per line? And does it always appear after the first number?
  • Karthik
    Karthik over 9 years
    yes only once .
  • muru
    muru over 9 years
    @Karthik check the update.
  • muru
    muru over 9 years
    @mikeserv presence of multiple numbers might be a problem. See the second comment on this answer.
  • mikeserv
    mikeserv over 9 years
    good point. missed that. plus echo was dumb - but if the last line doesn't end in a newline your counts will be off by one.
  • muru
    muru over 9 years
    @mikeserv I suppose a {grep ...; echo} should fix that?
  • Karthik
    Karthik over 9 years
    thanks but will this script also read any numeric (just in case) which occurs after the '-' character.
  • Karthik
    Karthik over 9 years
    I provided only 4 digit numerics in the sample data. My original data contains entries with 2,3 digit numeric values too. The only condition is it should not count any numeric values after the '-' character in the entry.
  • garethTheRed
    garethTheRed over 9 years
    @Karthik - I've changed the grep search pattern to look for at least one digit.
  • Karthik
    Karthik over 9 years
    @muru grep -Eo '[0-9]+-' file | sort -u | wc -l gives me an approximate count when applied on my original input file. Thanks. Now, I need a small favor. Let me pull out 5 entries from my input. unit7433-nobody unit7333-opera unit7333-retailbanking unit7133-telco unit6333-sales. This type 'unit' contains 4 unique numbers in 5 entries. Another case is psf7433-nlhrms psf7331-pdesmond where the type 'psf' contains 2 unique numbers in its 2 entries. How to pull out the count of unique numbers from each type in the input? Should we use a 'if' clause or something else?
  • Karthik
    Karthik over 9 years
    I need a small favor. Let me pull out 5 entries from my input. unit7433-nobody unit7333-opera unit7333-retailbanking unit7133-telco unit6333-sales. This type 'unit' contains 4 unique numbers in 5 entries. Another case is psf7433-nlhrms psf7331-pdesmond where the type 'psf' contains 2 unique numbers in its 2 entries. How to pull out the count of unique numbers from each type in the input? Should we use a 'if' clause or something else?
  • muru
    muru over 9 years
    @Karthik you should consider using something like awk for that. Try: sed -r 's/([a-z]*)([0-9]*)-.*/\1 \2/' | awk '{a[$1][$2]++}END{for (i in a) {print i,length(a[i])}}'
  • Karthik
    Karthik over 9 years
    to run this script for my input file all-dss-accounts.txt, i suppose i should use this sed -r 's/([a-z]*)([0-9]*)-.*/\1 \2/' | awk '{a[$1][$2]++}END{for (i in a) {print i,length(a[i])}}' all-dss-accounts.txt
  • muru
    muru over 9 years
    @Karthik other way around sed ... all-dss-accounts | awk ...
  • Karthik
    Karthik over 9 years
    gives me this error awk: {a[$1][$2]++}END{for (i in a) {print i,length(a[i])}} awk: ^ syntax error
  • muru
    muru over 9 years
    @Karthik looks like I used a GNU awk feature. Any chance you can get GNU awk?
  • Karthik
    Karthik over 9 years
    @muru i donno how to explain it but i'm working on a unix server using putty. i'm running my own business so i use the client's unix machine to get the input. i dont think i can use GNU.
  • muru
    muru over 9 years
    @Karthik replace the awk expression with this: '{a[$1,$2]++} END{for (ij in a) {split(ij,xx,SUBSEP); b[xx[1]]++;} for (i in b) {print i,b[i]}}'
  • Karthik
    Karthik over 9 years
    sorry for the trouble @muru but i get ./test1.sh: line 14: {a[$1,$2]++} END{for (ij in a) {split(ij,xx,SUBSEP); b[xx[1]]++;} for (i in b) {print i,b[i]}}: command not found as the result :(
  • muru
    muru over 9 years
    @Karthik that's an awk expression. Use it like this: sed ... | awk '{a[$1,$2]++} END{for (ij in a) {split(ij,xx,SUBSEP); b[xx[1]]++;} for (i in b) {print i,b[i]}}'.
  • Karthik
    Karthik over 9 years
    it gives the result @muru but i am actually expecting an output like this link. Here the 'count' column gives the total number of entries of the type. I want the 'unique numeric' column to show all the unique numeric values for the particular type (fap, unit, psf etc...)