How to find frequency of occurrences of strings contained in a file?

bash shell-script files text-processing patterns

9,648

Solution 1

Original input file

Assuming the following input format:

http://www.google.com,

www.google.com,

google.com

yahoo.com

With a result looking like this:

google.com : 3 
yahoo.com : 1

It's hard to determine the entire situation you're in but given the output you're showing us I'd be inclined to convert the input file first so that all the lines are of the form:

google.com
google.com
google.com
yahoo.com

And then run this file through the following set of commands:

$ grep -v "^$" data.txt | \
      sed -e 's/,$//' -e 's/.*\.\(.*\)\.\(.*\)$/\1.\2/' | \
      sort | uniq -c
      3 google.com
      1 yahoo.com

You can clean up the format of the output so it matches what you want like this:

$ grep -v "^$" data.txt | \
      sed -e 's/,$//' -e 's/.*\.\(.*\)\.\(.*\)$/\1.\2/' | \
      sort | uniq -c | \
      awk '{printf "%s : %s\n", $1, $2}'
      google.com : 3
      yahoo.com : 1

EDIT #1

The OP had a follow-up question where he changed the inputs in the example. So to count this type of input:

http://www.google.com/absd/siidfs/kfd837382$%^$&,

www.google.com,

google.com

yahoo.com/list/page/jhfjkshdjf...

You could use this adapted one-liner from the first example:

$ grep -v "^$" data2.txt | \
      sed -e 's/,$//' \
          -e 's#\(http://[^/]\+\).*#\1#' \
          -e '/^[^http]/ s/^www\.//' \
          -e '/^[^http]/ s#\([^/]\+\).*$#\1#' | \
          sort | uniq -c | \
          awk '{printf "%s : %s\n", $1, $2}'
2 : google.com
1 : http://www.google.com
1 : yahoo.com

Solution 2

You probably want to use sort and uniq -c to get the counts correct, then use sed or awk to do final formatting. Something like this:

sort file | uniq -c | awk '{printf "%s : %s\n", $1, $2}'

Your original question could probably be answered with the same basic pipeline, but first editing the input:

sed -e 's/http:\/\///' -e 's/^www\.//' file | sort | uniq -c |
awk '{printf "%s : %s\n", $1, $2}'

If that's not exactly correct, you can tinker with the sed and awk commands to get hostname forms and output format correct. For example, to clean off the right-hand-side of longer URLs:

sed -e 's/http:\/\///' -e 's/^www\.//' -e 's/\/..*$//' file |
sort | uniq -c |
awk '{printf "%s : %s\n", $1, $2}'

9,648

coder

Updated on September 18, 2022

Comments

coder over 1 year
I have a file that contains a list of URLs of the form

EDIT

http://www.google.com/absd/siidfs/kfd837382$%^$&,

www.google.com,

google.com

yahoo.com/list/page/jhfjkshdjf...

I want to write a script that will display the following output
```
 google.com : 2
http://www.google.com: 1
yahoo.com : 1 
```
I am stuck with the part that I have to read the URLs from the file and check the whole file again. I am new to bash scripting and hence I don't know how to do this.
coder over 10 years

what to do if I have longer URLs as asked in the edited question?
coder over 10 years

what to do if I have longer URLs as asked in the edited question?
slm over 10 years

@coder - you probably should've asked that as a different question. This one was asked and answered, you're breaking all the answers here and making a lot of clean up work for everyone. In the 2nd Q you can reference this one and say you need to expand the solution. Please ask it as a 2nd question and roll the edits back on this one so that it's as it was before. I can help if you don't know what to do.
slm over 10 years

@coder - I've provided a solution but I'd still like you to split this Q into 2 please. I'll move the portion of my answer from here to there after you've created the Q. It messes with the continuity of the site if we have overly complex Q's intermixed within the same one.
Admin over 10 years

@coder - see second script. I think that works.