View unicode codepoints for all letters in file on bash

6,776

Solution 1

I wrote myself a perl one-liner, that do just that, and it also prints the original character. (It expects the file from STDIN)

perl -C7 -ne 'for(split(//)){print sprintf("U+%04X", ord)." ".$_."\n"}'

However, there should be a better way than this.

Solution 2

I needed the code point for some common smileys, and came up with this:

echo -n "๐Ÿ˜Š" |              # -n ignore trailing newline                     \
iconv -f utf8 -t utf32be |  # UTF-32 big-endian happens to be the code point \
xxd -p |                    # -p just give me the plain hex                  \
sed -r 's/^0+/0x/' |        # remove leading 0's, replace with 0x            \
xargs printf 'U+%04X\n'     # pretty print the code point

which prints

U+1F60A

which is the code point for "SMILING FACE WITH SMILING EYES".

Solution 3

Inspired by Neftas's answer, here is a slightly simpler solution that works with strings, rather than a single char:

iconv -f utf8 -t utf32le | hexdump -v -e '8/4 "0x%04x " "\n"' | sed -re"s/0x /   /g"
#                                         ^
# The number `8` above determines the number of columns in the output. Modify as needed.

I also made a Bash script that reads from stdin, or from a file, and that displays the original text alongside with the unicode values:

COLWIDTH=8
SHOWTEXT=true

tmpfile=$(mktemp)
cp "${1:-/dev/stdin}" "$tmpfile"
left=$(set -o pipefail; iconv -f utf8 -t utf32le "$tmpfile" | hexdump -v -e $COLWIDTH'/4 "0x%05x " "\n"' | sed -re"s/0x /   /g")


if [ $? -gt 0 ]; then
    echo "ERROR: Could not convert input" >&2
elif $SHOWTEXT; then
    right=$(tr [:space:] . < "$tmpfile" | sed -re "s/.{$COLWIDTH}/|&|\n/g" | sed -re "s/^.{1,$((COLWIDTH+1))}\$/|&|/g")
    pr -mts" " <(echo "$left") <(echo "$right")
else
    echo "$left"
fi


rm "$tmpfile"

Sample output

Solution 4

This is a solution requiring bash and using only built-ins:

while IFS= read -d $'\000' -n 1 x; do printf '%X\n' "'$x"; done

If you want to see the characters with their mappings, you can use this:

while IFS= read -d $'\000' -n 1 x; do printf '%2s -> %X\n' "$x" "'$x"; done

For example:

$ echo 'Hi! ๐Ÿ˜Š' | while IFS= read -d $'\000' -n 1 x; do printf '%2s -> %X\n' "$x" "'$x"; done
 H -> 48
 i -> 69
 ! -> 21
   -> 20
๐Ÿ˜Š -> 1F60A
 
 -> A

Note:

  • IFS= and -d $'\000' are needed to get all the characters to be mapped. Without them, newlines and word separators will come out as zeroes, which is fine if that is what you prefer.
$ echo 'Hi! ๐Ÿ˜Š' | while read  -n 1 x; do printf '%2s -> %X\n' "$x" "'$x"; done
 H -> 48
 i -> 69
 ! -> 21
   -> 0
๐Ÿ˜Š -> 1F60A
   -> 0

Solution 5

The perl oneliner didn't work for me, and I couldn't get the hexdump methods to display the actual character besides the codepoint, so here's a python oneliner:

python -c 'import sys; print("\n".join(["\\u%04x -> %s" % (ord(c), c) for c in sys.stdin.read() if c.strip()]))'

The output is something like this:

$ cat test.txt 
A รก รœ ร‘  ๆ—ฅๆœฌ่ชž 1  ๏ผ‘  /  _
$ python -c 'import sys; print("\n".join(["\\u%04x -> %s" % (ord(c), c) for c in sys.stdin.read() if c.strip()]))' < test.txt
\u0041 -> A
\u00e1 -> รก
\u00dc -> รœ
\u00d1 -> ร‘
\u65e5 -> ๆ—ฅ
\u672c -> ๆœฌ
\u8a9e -> ่ชž
\u0031 -> 1
\uff11 -> ๏ผ‘
\u002f -> /
\u005f -> _

Note: for python2 the text would need to be decoded:

python2 -c 'import sys; print("\n".join(["\\u%04x -> %s" % (ord(c), c) for c in sys.stdin.read().decode("utf-8") if c.strip()]))'
Share:
6,776

Related videos on Youtube

Windor C
Author by

Windor C

Your about me is currently blank.

Updated on September 18, 2022

Comments

  • Windor C
    Windor C almost 2 years

    I have to deal with a file that has lot of invisible control characters, like "right to left" or "zero width non-joiner", different spaces than the normal space and so on, and I have troubles dealing with that.

    Now, I would like to somehow view all letters in a given file, letter by letter (I would like to say "left to right", but I am unfortunately dealing with right-to-left language), as unicode codepoints, using only basic bash tools (like vi, less, cat...). Is it possible somehow?

    I know I can display the file in hexadecimal by hexdump, but I would have to recompute the codepoints. I really want to see the actual unicode codepoints, so I can google them and find out what's happenning.

    edit: I will add that I don't want to transcode it to different encoding (because that's what I am finding out online). I have the file in UTF8 and that is fine. I just want to know the exact codepoints of all the letters.

  • Yan King Yin
    Yan King Yin about 4 years
    Yes, it works, and we need this command
  • Windor C
    Windor C about 4 years
    honestly this answer is 2012. Today, I would just use xxd.... and stayed from perl as far as possible.
  • Yan King Yin
    Yan King Yin about 4 years
    xxd doesn't display unicode
  • Microsoft Linux TM
    Microsoft Linux TM over 2 years
    @KarelBílek May I ask (out of curiosity) why you would stay away from Perl?