What is a practical way to list every character used in a file (Bash) (Regex)

5,341

Solution 1

You can use a combination of sed and sort:

$ echo "Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef." | 
>  sed 's/./&\n/g' | LC_COLLATE=C sort -u | tr -d '\n'
 '(),.:FJTabcdefghiklnoprstuwxy

sort does lexicographic sorting, so see man 7 ascii to see how the characters will order up.

Explanation:

  • sed 's/./&\n/g' - add a newline after every character, since sort (usually) does line-by-line sorting
  • LC_COLLATE=C sets the collation style to C (see What does “LC_ALL=C” do?)
  • sort -u: sorts the input and prints only the unique entries
  • tr -d '\n' deletes all the extra new lines.

If you want to keep only visible characters:

$ echo "Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef." | 
> tr -cd '[[:graph:]]' | sed 's/./&\n/g' | LC_COLLATE=C sort -u | tr -d '\n'
  • tr -cd '[[:graph:]]' deletes everything except visible characters.

Solution 2

You can print every character of a file in a separate line using fold -w1, then sort the output and eliminate the duplicates with sort -u (or sort | uniq):

$ cat test 
Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef.
$ fold -w1 test | sort -u

,
:
.
'
(
)
a
b
c
d
e
f
F
g
h
i
J
k
l
n
o
p
r
s
t
T
u
w
x
y

Then you can turn that into a single line again, for example with a paste -sd "" -:

$ fold -w1 test | sort -u | paste -sd "" -
 ,:.'()abcdefFghiJklnoprstTuwxy

Solution 3

Ooh, fun! Here are a few ways. The simplest (fold) has already been given, but here's a way to expand that to give the counts for each character as well:

$ fold -w 1 file | LC_ALL=C sort  | uniq -c
 11  
  2 "
  1 '
  1 (
  1 )
  3 ,
  1 .
  1 :
  1 F
  1 J
  1 T
  1 a
  1 b
  2 c
  2 d
  9 e
  4 f
  2 g
  4 h
  5 i
  1 k
  3 l
  7 n
  6 o
  1 p
  2 r
  4 s
  1 t
  2 u
  1 w
  1 x
  1 y

The use of LC_ALL=C sets the locale to C for the sort command which means that CAPITALS are sorted before lower cases as you requested. To get it all on the same line without counting the occurrences, but with the same sort order, you could do

$ echo $(fold -w 1 file | LC_ALL=C sort -u | tr -d '\n')
"'(),.:FJTabcdefghiklnoprstuwxy

You could also use Perl:

$ perl -lne '$k{$_}++ for split(//); END{print sort keys(%k)}' file
"'(),.:FJTabcdefghiklnoprstuwxy

Finally, here's a way that also shows special characters like tabs, newlines and carriage returns:

$ echo $(od -c file | grep -oP "^\d+ +\K.*" | tr -s ' ' '\n' | 
    LC_ALL=C sort -u | tr -d '\n')
"'(),.:FJT\n\r\tabcdefghiklnoprstuwxy
          ------
            |-------------> special characters

Solution 4

Just remove the duplicate characters from the input string. set function in python would create a set of items without any duplicates. ie, set('ssss') will give you a single s.

Through python3

$ cat file
Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef.

$ python3 -c 'import sys
with open(sys.argv[1]) as f:
    for line in f:
        print("".join(sorted(set(line))))' file
 '(),.:FJTabcdefghiklnoprstuwxy

If you want to remove the duplicate chars present in whole file then you could try this.

$ python3 -c 'import sys
with open(sys.argv[1]) as f:
    print("".join(sorted(set(f.read()))))' file
Share:
5,341

Related videos on Youtube

TuxForLife
Author by

TuxForLife

TuxForLife

Updated on September 18, 2022

Comments

  • TuxForLife
    TuxForLife over 1 year

    How can I turn this:

    Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef.
    

    To this:

     abcdefghiklnoprstuwFJT',():.
    

    (These are the total characters used in the input)

    Please note that the small case characters "jmqvz" were not in the input sentence, therefore, not outputted.

    The order is not important whatsoever, but lower case, then upper case, then special characters will be preferred.

    I am certain I will need sed/awk/etc. for this, but I have not found anything similar after extensive searching.

  • TuxForLife
    TuxForLife about 9 years
    Are non-visible characters ones like "\n"? If not, what do you mean by visible characters? Just tried your code, thank you it worked like a charm
  • muru
    muru about 9 years
    @user264974 newlines, spaces, tabs, control characters - everything except those in the ASCII range 0x21 to 0x7E (see en.wikipedia.org/wiki/Regular_expression#Character_classes).
  • TuxForLife
    TuxForLife about 9 years
    As a beginner, I was looking for a method using Bash, but thank you regardless, I will keep your method in case I ever explore Python. Thank you for the response. I also fixed my question by including a space, good catch!
  • Avinash Raj
    Avinash Raj about 9 years
    Most of the Linux distros has python installed by default. Python is a better replacement for bash tools. Learn it quick.
  • axings
    axings about 9 years
    Hrm, this will print the characters used per line while the question is per file.
  • Avinash Raj
    Avinash Raj about 9 years
    @chx good catch, check my update..
  • steeldriver
    steeldriver about 9 years
    Instead of sed, you could also use fold -w1
  • Dennis Williamson
    Dennis Williamson about 9 years
    grep -o . can be used in place of sed or fold
  • Boris Verkhovskiy
    Boris Verkhovskiy about 3 years
    This doesn't handle UTF-8, while the accepted answer does.