What is a practical way to list every character used in a file (Bash) (Regex)
Solution 1
You can use a combination of sed
and sort
:
$ echo "Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef." |
> sed 's/./&\n/g' | LC_COLLATE=C sort -u | tr -d '\n'
'(),.:FJTabcdefghiklnoprstuwxy
sort
does lexicographic sorting, so see man 7 ascii
to see how the characters will order up.
Explanation:
-
sed 's/./&\n/g'
- add a newline after every character, sincesort
(usually) does line-by-line sorting -
LC_COLLATE=C
sets the collation style toC
(see What does “LC_ALL=C” do?) -
sort -u
: sorts the input and prints only the unique entries -
tr -d '\n'
deletes all the extra new lines.
If you want to keep only visible characters:
$ echo "Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef." |
> tr -cd '[[:graph:]]' | sed 's/./&\n/g' | LC_COLLATE=C sort -u | tr -d '\n'
-
tr -cd '[[:graph:]]'
deletes everything except visible characters.
Solution 2
You can print every character of a file in a separate line using fold -w1
, then sort the output and eliminate the duplicates with sort -u
(or sort | uniq
):
$ cat test
Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef.
$ fold -w1 test | sort -u
,
:
.
'
(
)
a
b
c
d
e
f
F
g
h
i
J
k
l
n
o
p
r
s
t
T
u
w
x
y
Then you can turn that into a single line again, for example with a paste -sd "" -
:
$ fold -w1 test | sort -u | paste -sd "" -
,:.'()abcdefFghiJklnoprstTuwxy
Solution 3
Ooh, fun! Here are a few ways. The simplest (fold
) has already been given, but here's a way to expand that to give the counts for each character as well:
$ fold -w 1 file | LC_ALL=C sort | uniq -c
11
2 "
1 '
1 (
1 )
3 ,
1 .
1 :
1 F
1 J
1 T
1 a
1 b
2 c
2 d
9 e
4 f
2 g
4 h
5 i
1 k
3 l
7 n
6 o
1 p
2 r
4 s
1 t
2 u
1 w
1 x
1 y
The use of LC_ALL=C
sets the locale to C for the sort
command which means that CAPITALS are sorted before lower cases as you requested. To get it all on the same line without counting the occurrences, but with the same sort order, you could do
$ echo $(fold -w 1 file | LC_ALL=C sort -u | tr -d '\n')
"'(),.:FJTabcdefghiklnoprstuwxy
You could also use Perl:
$ perl -lne '$k{$_}++ for split(//); END{print sort keys(%k)}' file
"'(),.:FJTabcdefghiklnoprstuwxy
Finally, here's a way that also shows special characters like tabs, newlines and carriage returns:
$ echo $(od -c file | grep -oP "^\d+ +\K.*" | tr -s ' ' '\n' |
LC_ALL=C sort -u | tr -d '\n')
"'(),.:FJT\n\r\tabcdefghiklnoprstuwxy
------
|-------------> special characters
Solution 4
Just remove the duplicate characters from the input string. set
function in python would create a set of items without any duplicates. ie, set('ssss')
will give you a single s
.
Through python3
$ cat file
Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef.
$ python3 -c 'import sys
with open(sys.argv[1]) as f:
for line in f:
print("".join(sorted(set(line))))' file
'(),.:FJTabcdefghiklnoprstuwxy
If you want to remove the duplicate chars present in whole file then you could try this.
$ python3 -c 'import sys
with open(sys.argv[1]) as f:
print("".join(sorted(set(f.read()))))' file
Related videos on Youtube
Comments
-
TuxForLife over 1 year
How can I turn this:
Johnny's penguin, (Tuxie), likes the following foods: French fries, and beef.
To this:
abcdefghiklnoprstuwFJT',():.
(These are the total characters used in the input)
Please note that the small case characters "jmqvz" were not in the input sentence, therefore, not outputted.
The order is not important whatsoever, but lower case, then upper case, then special characters will be preferred.
I am certain I will need sed/awk/etc. for this, but I have not found anything similar after extensive searching.
-
TuxForLife about 9 yearsAre non-visible characters ones like "\n"? If not, what do you mean by visible characters? Just tried your code, thank you it worked like a charm
-
muru about 9 years@user264974 newlines, spaces, tabs, control characters - everything except those in the ASCII range 0x21 to 0x7E (see en.wikipedia.org/wiki/Regular_expression#Character_classes).
-
TuxForLife about 9 yearsAs a beginner, I was looking for a method using Bash, but thank you regardless, I will keep your method in case I ever explore Python. Thank you for the response. I also fixed my question by including a space, good catch!
-
Avinash Raj about 9 yearsMost of the Linux distros has python installed by default. Python is a better replacement for bash tools. Learn it quick.
-
axings about 9 yearsHrm, this will print the characters used per line while the question is per file.
-
Avinash Raj about 9 years@chx good catch, check my update..
-
steeldriver about 9 yearsInstead of
sed
, you could also usefold -w1
-
Dennis Williamson about 9 years
grep -o .
can be used in place ofsed
orfold
-
Boris Verkhovskiy about 3 yearsThis doesn't handle UTF-8, while the accepted answer does.