What is a practical way to list every character used in a file (Bash) (Regex)

bash sed regex awk

5,341

Solution 1

You can use a combination of sed and sort:

$ echo "Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef." | 
>  sed 's/./&\n/g' | LC_COLLATE=C sort -u | tr -d '\n'
 '(),.:FJTabcdefghiklnoprstuwxy

sort does lexicographic sorting, so see man 7 ascii to see how the characters will order up.

Explanation:

sed 's/./&\n/g' - add a newline after every character, since sort (usually) does line-by-line sorting
LC_COLLATE=C sets the collation style to C (see What does “LC_ALL=C” do?)
sort -u: sorts the input and prints only the unique entries
tr -d '\n' deletes all the extra new lines.

If you want to keep only visible characters:

$ echo "Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef." | 
> tr -cd '[[:graph:]]' | sed 's/./&\n/g' | LC_COLLATE=C sort -u | tr -d '\n'

tr -cd '[[:graph:]]' deletes everything except visible characters.

Solution 2

You can print every character of a file in a separate line using fold -w1, then sort the output and eliminate the duplicates with sort -u (or sort | uniq):

$ cat test 
Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef.
$ fold -w1 test | sort -u

,
:
.
'
(
)
a
b
c
d
e
f
F
g
h
i
J
k
l
n
o
p
r
s
t
T
u
w
x
y

Then you can turn that into a single line again, for example with a paste -sd "" -:

$ fold -w1 test | sort -u | paste -sd "" -
 ,:.'()abcdefFghiJklnoprstTuwxy

Solution 3

Ooh, fun! Here are a few ways. The simplest (fold) has already been given, but here's a way to expand that to give the counts for each character as well:

$ fold -w 1 file | LC_ALL=C sort  | uniq -c
 11  
  2 "
  1 '
  1 (
  1 )
  3 ,
  1 .
  1 :
  1 F
  1 J
  1 T
  1 a
  1 b
  2 c
  2 d
  9 e
  4 f
  2 g
  4 h
  5 i
  1 k
  3 l
  7 n
  6 o
  1 p
  2 r
  4 s
  1 t
  2 u
  1 w
  1 x
  1 y

The use of LC_ALL=C sets the locale to C for the sort command which means that CAPITALS are sorted before lower cases as you requested. To get it all on the same line without counting the occurrences, but with the same sort order, you could do

$ echo $(fold -w 1 file | LC_ALL=C sort -u | tr -d '\n')
"'(),.:FJTabcdefghiklnoprstuwxy

You could also use Perl:

$ perl -lne '$k{$_}++ for split(//); END{print sort keys(%k)}' file
"'(),.:FJTabcdefghiklnoprstuwxy

Finally, here's a way that also shows special characters like tabs, newlines and carriage returns:

$ echo $(od -c file | grep -oP "^\d+ +\K.*" | tr -s ' ' '\n' | 
    LC_ALL=C sort -u | tr -d '\n')
"'(),.:FJT\n\r\tabcdefghiklnoprstuwxy
          ------
            |-------------> special characters

Solution 4

Just remove the duplicate characters from the input string. set function in python would create a set of items without any duplicates. ie, set('ssss') will give you a single s.

Through python3

$ cat file
Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef.

$ python3 -c 'import sys
with open(sys.argv[1]) as f:
    for line in f:
        print("".join(sorted(set(line))))' file
 '(),.:FJTabcdefghiklnoprstuwxy

If you want to remove the duplicate chars present in whole file then you could try this.

$ python3 -c 'import sys
with open(sys.argv[1]) as f:
    print("".join(sorted(set(f.read()))))' file

View more solutions

5,341

TuxForLife

Updated on September 18, 2022

Comments

TuxForLife over 1 year
How can I turn this:
```
Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef.
```
To this:
```
 abcdefghiklnoprstuwFJT',():.
```
(These are the total characters used in the input)

Please note that the small case characters "jmqvz" were not in the input sentence, therefore, not outputted.

The order is not important whatsoever, but lower case, then upper case, then special characters will be preferred.

I am certain I will need sed/awk/etc. for this, but I have not found anything similar after extensive searching.
TuxForLife about 9 years

Are non-visible characters ones like "\n"? If not, what do you mean by visible characters? Just tried your code, thank you it worked like a charm
muru about 9 years

@user264974 newlines, spaces, tabs, control characters - everything except those in the ASCII range 0x21 to 0x7E (see en.wikipedia.org/wiki/Regular_expression#Character_classes).
TuxForLife about 9 years

As a beginner, I was looking for a method using Bash, but thank you regardless, I will keep your method in case I ever explore Python. Thank you for the response. I also fixed my question by including a space, good catch!
Avinash Raj about 9 years

Most of the Linux distros has python installed by default. Python is a better replacement for bash tools. Learn it quick.
axings about 9 years

Hrm, this will print the characters used per line while the question is per file.
Avinash Raj about 9 years

@chx good catch, check my update..
steeldriver about 9 years

Instead of sed, you could also use fold -w1
Dennis Williamson about 9 years

grep -o . can be used in place of sed or fold
Boris Verkhovskiy about 3 years

This doesn't handle UTF-8, while the accepted answer does.