How to get the unique count of a particular part of a string
Solution 1
With grep
, filter out just the numbers:
grep -Eo '[0-9]+-' file | sort -u | wc -l
-
[0-9]
Matches any character between 0 and 9 (any digit). -
+
in extended regular expressions stands for at least one character (that's why the-E
option is used withgrep
). So[0-9]+-
matches one or more digits, followed by-
. -
-o
only prints the part that matched your pattern, so given inputabcd23-gf56
,grep
will only print23-
. -
sort -u
sorts and filters unique entries (due to-u
), andwc -l
counts the number of lines in input (hence, the number of unique entries).
Solution 2
Use extended grep to and look for four digits, telling grep to only list the matches (as opposed to the whole line, which is the default):
grep -Eo '[0-9]+' <filename>
Sort this list of numbers and only output unique ones:
sort -u
Count the number of lines:
wc -l
Put it all together:
$ grep -Eo '[0-9]+' filename | sort -u | wc -l
8
Solution 3
You can use:
tr -dc '\-0-9\n' | sort -u -t- -nk1,1 | grep -c .
...which is, admittedly, more than a little inspired by muru's answer here. Differently, though, I use grep
to count the lines rather than wc
in case there are blank lines in input. His answer doesn't have a blank line problem as grep -o
will only print lines with their match (as grep -c
only counts them here), but tr
does print blank lines because the newline is one of the few characters it does not delete. This means any number of blank lines in input would skew wc
's results by one.
So tr
here is more efficient probably than is grep -o
but likely wc
beats grep
in the counting department. I like it this way I think for portability reasons, and also because I usually try to prune data with the most efficient filter first, and to use the less efficient ones later in the chain.
This lets sort
pick the bits per line that it will consider in its -u
nique sort based on its -n
umeric sort -k
ey which it splits on its -t-
ab delimiter. tr
-d
eletes the -c
omplement of any numeric, dash, or newline byte in its input. That way - so long as there are no -
dashes occurring before the numeric strings you wish to compare, then the only thing remaining to any line is:
#nothing at all
...or...
[numbers]
...or...
[numbers]-[more numbers]more-dashes-...
So when the output is piped to sort
we instruct it only to compare numeric strings occurring before a dash if any. In that way -dashes or not - the only numbers which matter are the ones you want to count.
So then we grep -c
ount lines containing at least a single .
character. The following command prints 8
:
tr -dc '\-0-9\n' <<\IN | sort -u -t- -nk1,1 | grep -c .
psf7433-nlhrms
unit7433-nobody
unit7333-opera
bpx7333-operations
app7333-osm
unit7330-partners
psf7331-pdesmond
unit7333-pro-09-0jm
mnp7330-redir09o-0ect
unit7333-retailbanking
cpq7333-rkarmer
unit6333-sales
ring7323-support
unit7133-telco
post7323-uadb
sun7335-ukhrms
burp7133-wfnmreply
IN
Related videos on Youtube
Karthik
Updated on September 18, 2022Comments
-
Karthik over 1 year
I have a set of data in a file.
psf7433-nlhrms unit7433-nobody unit7333-opera bpx7333-operations app7333-osm unit7330-partners psf7331-pdesmond unit7333-projm mnp7330-redirect unit7333-retailbanking cpq7333-rkarmer unit6333-sales ring7323-support unit7133-telco post7323-uadb sun7335-ukhrms burp7133-wfnmreply
How to ignore the starting alphabetic characters in each line and the characters after the numeric and get the count of the unique numbers. (or) How to retrieve only the numeric value in each line and get their unique count.
Considering we manage to extract only the numeric values, we will get this.
7433 7433 7333 7333 7333 7330 7331 7333 7330 7333 7333 6333 7323 7133 7323 7335 7133
Now, I want the unique count of the retrieved numeric values. So ignoring the repetitions, I should get the following final output.
8
I am unable to do this either by using awk or sed or even simple grep | cut
I do not want the list of extracted values, I want only the final count as the answer.
Help me!
-
Karthik over 9 yearsHi, I just modified the question with a new set of input. Will the code be applicable for the new input too.
-
Gilles Quenot over 9 yearsPOST edited accordingly
-
Karthik over 9 yearsi'm new to shell scripting so can you explain what this script does in short. thanks a ton!
-
Karthik over 9 yearsgives me the answer 0
-
Gilles Quenot over 9 yearsYou have to modify
file
by your own file Karthik -
Karthik over 9 yearsworks after the edit! just a doubt...will this piece of script work even if one of the file entries contains a line like sun73-ukhrms. where the numeric value is just 2 digits. Moreover, hope this script does not read any numeric value after the '-' character in the line.
-
Karthik over 9 yearsas per my requirement, the script should ignore any numeric value after the '-' character in the input (if any). For eg: if the input is sun7335-ukhrms8768, it should not read 8768 as it falls after the '-' character.
-
Gilles Quenot over 9 yearsNew requirement satisfied chief
-
muru over 9 years@Karthik Does the
-
appear only once per line? And does it always appear after the first number? -
Karthik over 9 yearsyes only once .
-
muru over 9 years@Karthik check the update.
-
muru over 9 years@mikeserv presence of multiple numbers might be a problem. See the second comment on this answer.
-
mikeserv over 9 yearsgood point. missed that. plus
echo
was dumb - but if the last line doesn't end in a newline your counts will be off by one. -
muru over 9 years@mikeserv I suppose a
{grep ...; echo}
should fix that? -
Karthik over 9 yearsthanks but will this script also read any numeric (just in case) which occurs after the '-' character.
-
Karthik over 9 yearsI provided only 4 digit numerics in the sample data. My original data contains entries with 2,3 digit numeric values too. The only condition is it should not count any numeric values after the '-' character in the entry.
-
garethTheRed over 9 years@Karthik - I've changed the
grep
search pattern to look for at least one digit. -
Karthik over 9 years@muru
grep -Eo '[0-9]+-' file | sort -u | wc -l
gives me an approximate count when applied on my original input file. Thanks. Now, I need a small favor. Let me pull out 5 entries from my input.unit7433-nobody unit7333-opera unit7333-retailbanking unit7133-telco unit6333-sales
. This type 'unit' contains 4 unique numbers in 5 entries. Another case ispsf7433-nlhrms psf7331-pdesmond
where the type 'psf' contains 2 unique numbers in its 2 entries. How to pull out the count of unique numbers from each type in the input? Should we use a 'if' clause or something else? -
Karthik over 9 yearsI need a small favor. Let me pull out 5 entries from my input.
unit7433-nobody unit7333-opera unit7333-retailbanking unit7133-telco unit6333-sales
. This type 'unit' contains 4 unique numbers in 5 entries. Another case ispsf7433-nlhrms psf7331-pdesmond
where the type 'psf' contains 2 unique numbers in its 2 entries. How to pull out the count of unique numbers from each type in the input? Should we use a 'if' clause or something else? -
muru over 9 years@Karthik you should consider using something like awk for that. Try:
sed -r 's/([a-z]*)([0-9]*)-.*/\1 \2/' | awk '{a[$1][$2]++}END{for (i in a) {print i,length(a[i])}}'
-
Karthik over 9 yearsto run this script for my input file all-dss-accounts.txt, i suppose i should use this
sed -r 's/([a-z]*)([0-9]*)-.*/\1 \2/' | awk '{a[$1][$2]++}END{for (i in a) {print i,length(a[i])}}' all-dss-accounts.txt
-
muru over 9 years@Karthik other way around
sed ... all-dss-accounts | awk ...
-
Karthik over 9 yearsgives me this error
awk: {a[$1][$2]++}END{for (i in a) {print i,length(a[i])}}
awk: ^ syntax error
-
muru over 9 years@Karthik looks like I used a GNU awk feature. Any chance you can get GNU awk?
-
Karthik over 9 years@muru i donno how to explain it but i'm working on a unix server using putty. i'm running my own business so i use the client's unix machine to get the input. i dont think i can use GNU.
-
muru over 9 years@Karthik replace the awk expression with this:
'{a[$1,$2]++} END{for (ij in a) {split(ij,xx,SUBSEP); b[xx[1]]++;} for (i in b) {print i,b[i]}}'
-
Karthik over 9 yearssorry for the trouble @muru but i get
./test1.sh: line 14: {a[$1,$2]++} END{for (ij in a) {split(ij,xx,SUBSEP); b[xx[1]]++;} for (i in b) {print i,b[i]}}: command not found
as the result :( -
muru over 9 years@Karthik that's an awk expression. Use it like this:
sed ... | awk '{a[$1,$2]++} END{for (ij in a) {split(ij,xx,SUBSEP); b[xx[1]]++;} for (i in b) {print i,b[i]}}'
. -
Karthik over 9 yearsit gives the result @muru but i am actually expecting an output like this link. Here the 'count' column gives the total number of entries of the type. I want the 'unique numeric' column to show all the unique numeric values for the particular type (fap, unit, psf etc...)