How to get the unique count of a particular part of a string

shell text-processing scripting grep sort

9,273

Solution 1

With grep, filter out just the numbers:

grep -Eo '[0-9]+-' file | sort -u | wc -l

[0-9] Matches any character between 0 and 9 (any digit).
+ in extended regular expressions stands for at least one character (that's why the -E option is used with grep). So [0-9]+- matches one or more digits, followed by -.
-o only prints the part that matched your pattern, so given input abcd23-gf56, grep will only print 23-.
sort -u sorts and filters unique entries (due to -u), and wc -l counts the number of lines in input (hence, the number of unique entries).

Solution 2

Use extended grep to and look for four digits, telling grep to only list the matches (as opposed to the whole line, which is the default):

grep -Eo '[0-9]+' <filename>

Sort this list of numbers and only output unique ones:

sort -u

Count the number of lines:

wc -l

Put it all together:

$ grep -Eo '[0-9]+' filename | sort -u | wc -l
8

Solution 3

You can use:

tr -dc '\-0-9\n' | sort -u -t- -nk1,1 | grep -c .

...which is, admittedly, more than a little inspired by muru's answer here. Differently, though, I use grep to count the lines rather than wc in case there are blank lines in input. His answer doesn't have a blank line problem as grep -o will only print lines with their match (as grep -c only counts them here), but tr does print blank lines because the newline is one of the few characters it does not delete. This means any number of blank lines in input would skew wc's results by one.

So tr here is more efficient probably than is grep -o but likely wc beats grep in the counting department. I like it this way I think for portability reasons, and also because I usually try to prune data with the most efficient filter first, and to use the less efficient ones later in the chain.

This lets sort pick the bits per line that it will consider in its -unique sort based on its -numeric sort -key which it splits on its -t-ab delimiter. tr -deletes the -complement of any numeric, dash, or newline byte in its input. That way - so long as there are no -dashes occurring before the numeric strings you wish to compare, then the only thing remaining to any line is:

#nothing at all

...or...

[numbers]

...or...

[numbers]-[more numbers]more-dashes-...

So when the output is piped to sort we instruct it only to compare numeric strings occurring before a dash if any. In that way -dashes or not - the only numbers which matter are the ones you want to count.

So then we grep -count lines containing at least a single . character. The following command prints 8:

tr -dc '\-0-9\n' <<\IN | sort -u -t- -nk1,1 | grep -c .
psf7433-nlhrms
unit7433-nobody
unit7333-opera
bpx7333-operations
app7333-osm
unit7330-partners
psf7331-pdesmond
unit7333-pro-09-0jm
mnp7330-redir09o-0ect
unit7333-retailbanking
cpq7333-rkarmer
unit6333-sales
ring7323-support


unit7133-telco
post7323-uadb
sun7335-ukhrms
burp7133-wfnmreply
IN

9,273

Karthik

Updated on September 18, 2022

Comments

Karthik over 1 year
I have a set of data in a file.
```
psf7433-nlhrms
unit7433-nobody
unit7333-opera
bpx7333-operations
app7333-osm
unit7330-partners
psf7331-pdesmond
unit7333-projm
mnp7330-redirect
unit7333-retailbanking
cpq7333-rkarmer
unit6333-sales
ring7323-support
unit7133-telco
post7323-uadb
sun7335-ukhrms
burp7133-wfnmreply
```
How to ignore the starting alphabetic characters in each line and the characters after the numeric and get the count of the unique numbers. (or) How to retrieve only the numeric value in each line and get their unique count.

Considering we manage to extract only the numeric values, we will get this.
```
7433
7433
7333
7333
7333
7330
7331
7333
7330
7333
7333
6333
7323
7133
7323
7335
7133
```
Now, I want the unique count of the retrieved numeric values. So ignoring the repetitions, I should get the following final output.
```
8
```
I am unable to do this either by using awk or sed or even simple grep | cut

I do not want the list of extracted values, I want only the final count as the answer.

Help me!
Karthik over 9 years

Hi, I just modified the question with a new set of input. Will the code be applicable for the new input too.
Gilles Quenot over 9 years

POST edited accordingly
Karthik over 9 years

i'm new to shell scripting so can you explain what this script does in short. thanks a ton!
Karthik over 9 years

gives me the answer 0
Gilles Quenot over 9 years

You have to modify file by your own file Karthik
Karthik over 9 years

works after the edit! just a doubt...will this piece of script work even if one of the file entries contains a line like sun73-ukhrms. where the numeric value is just 2 digits. Moreover, hope this script does not read any numeric value after the '-' character in the line.
Karthik over 9 years

as per my requirement, the script should ignore any numeric value after the '-' character in the input (if any). For eg: if the input is sun7335-ukhrms8768, it should not read 8768 as it falls after the '-' character.
Gilles Quenot over 9 years

New requirement satisfied chief
muru over 9 years

@Karthik Does the - appear only once per line? And does it always appear after the first number?
Karthik over 9 years

yes only once .
muru over 9 years

@Karthik check the update.
muru over 9 years

@mikeserv presence of multiple numbers might be a problem. See the second comment on this answer.
mikeserv over 9 years

good point. missed that. plus echo was dumb - but if the last line doesn't end in a newline your counts will be off by one.
muru over 9 years

@mikeserv I suppose a {grep ...; echo} should fix that?
Karthik over 9 years

thanks but will this script also read any numeric (just in case) which occurs after the '-' character.
Karthik over 9 years

I provided only 4 digit numerics in the sample data. My original data contains entries with 2,3 digit numeric values too. The only condition is it should not count any numeric values after the '-' character in the entry.
garethTheRed over 9 years

@Karthik - I've changed the grep search pattern to look for at least one digit.
Karthik over 9 years

@muru grep -Eo '[0-9]+-' file | sort -u | wc -l gives me an approximate count when applied on my original input file. Thanks. Now, I need a small favor. Let me pull out 5 entries from my input. unit7433-nobody unit7333-opera unit7333-retailbanking unit7133-telco unit6333-sales. This type 'unit' contains 4 unique numbers in 5 entries. Another case is psf7433-nlhrms psf7331-pdesmond where the type 'psf' contains 2 unique numbers in its 2 entries. How to pull out the count of unique numbers from each type in the input? Should we use a 'if' clause or something else?
Karthik over 9 years

I need a small favor. Let me pull out 5 entries from my input. unit7433-nobody unit7333-opera unit7333-retailbanking unit7133-telco unit6333-sales. This type 'unit' contains 4 unique numbers in 5 entries. Another case is psf7433-nlhrms psf7331-pdesmond where the type 'psf' contains 2 unique numbers in its 2 entries. How to pull out the count of unique numbers from each type in the input? Should we use a 'if' clause or something else?
muru over 9 years

@Karthik you should consider using something like awk for that. Try: sed -r 's/([a-z]*)([0-9]*)-.*/\1 \2/' | awk '{a[$1][$2]++}END{for (i in a) {print i,length(a[i])}}'
Karthik over 9 years

to run this script for my input file all-dss-accounts.txt, i suppose i should use this sed -r 's/([a-z]*)([0-9]*)-.*/\1 \2/' | awk '{a[$1][$2]++}END{for (i in a) {print i,length(a[i])}}' all-dss-accounts.txt
muru over 9 years

@Karthik other way around sed ... all-dss-accounts | awk ...
Karthik over 9 years

gives me this error awk: {a[$1][$2]++}END{for (i in a) {print i,length(a[i])}} awk: ^ syntax error
muru over 9 years

@Karthik looks like I used a GNU awk feature. Any chance you can get GNU awk?
Karthik over 9 years

@muru i donno how to explain it but i'm working on a unix server using putty. i'm running my own business so i use the client's unix machine to get the input. i dont think i can use GNU.
muru over 9 years

@Karthik replace the awk expression with this: '{a[$1,$2]++} END{for (ij in a) {split(ij,xx,SUBSEP); b[xx[1]]++;} for (i in b) {print i,b[i]}}'
Karthik over 9 years

sorry for the trouble @muru but i get ./test1.sh: line 14: {a[$1,$2]++} END{for (ij in a) {split(ij,xx,SUBSEP); b[xx[1]]++;} for (i in b) {print i,b[i]}}: command not found as the result :(
muru over 9 years

@Karthik that's an awk expression. Use it like this: sed ... | awk '{a[$1,$2]++} END{for (ij in a) {split(ij,xx,SUBSEP); b[xx[1]]++;} for (i in b) {print i,b[i]}}'.
Karthik over 9 years

it gives the result @muru but i am actually expecting an output like this link. Here the 'count' column gives the total number of entries of the type. I want the 'unique numeric' column to show all the unique numeric values for the particular type (fap, unit, psf etc...)