Finding out what characters a given font supports

36,645

Solution 1

Here is a method using the fontTools Python library (which you can install with something like pip install fonttools):

#!/usr/bin/env python
from itertools import chain
import sys

from fontTools.ttLib import TTFont
from fontTools.unicode import Unicode

with TTFont(
    sys.argv[1], 0, allowVID=0, ignoreDecompileErrors=True, fontNumber=-1
) as ttf:
    chars = chain.from_iterable(
        [y + (Unicode[y[0]],) for y in x.cmap.items()] for x in ttf["cmap"].tables
    )
    if len(sys.argv) == 2:  # print all code points
        for c in chars:
            print(c)
    elif len(sys.argv) >= 3:  # search code points / characters
        code_points = {c[0] for c in chars}
        for i in sys.argv[2:]:
            code_point = int(i)   # search code point
            #code_point = ord(i)  # search character
            print(Unicode[code_point])
            print(code_point in code_points)

The script takes as arguments the font path and optionally code points / characters to search for:

$ python checkfont.py /usr/share/fonts/**/DejaVuSans.ttf
(32, 'space', 'SPACE')
(33, 'exclam', 'EXCLAMATION MARK')
(34, 'quotedbl', 'QUOTATION MARK')
…

$ python checkfont.py /usr/share/fonts/**/DejaVuSans.ttf 65 12622  # a ㅎ
LATIN CAPITAL LETTER A
True
HANGUL LETTER HIEUH
False

Solution 2

The X program xfd can do this. To see all characters for the "DejaVu Sans Mono" font, run:

xfd -fa "DejaVu Sans Mono"

It's included in the x11-utils package on Debian/Ubuntu, xorg-x11-apps on Fedora/RHEL, and xorg-xfd on Arch Linux.

Solution 3

The fontconfig commands can output the glyph list as a compact list of ranges, eg:

$ fc-match --format='%{charset}\n' OpenSans
20-7e a0-17f 192 1a0-1a1 1af-1b0 1f0 1fa-1ff 218-21b 237 2bc 2c6-2c7 2c9
2d8-2dd 2f3 300-301 303 309 30f 323 384-38a 38c 38e-3a1 3a3-3ce 3d1-3d2 3d6
400-486 488-513 1e00-1e01 1e3e-1e3f 1e80-1e85 1ea0-1ef9 1f4d 2000-200b
2013-2015 2017-201e 2020-2022 2026 2030 2032-2033 2039-203a 203c 2044 2070
2074-2079 207f 20a3-20a4 20a7 20ab-20ac 2105 2113 2116 2120 2122 2126 212e
215b-215e 2202 2206 220f 2211-2212 221a 221e 222b 2248 2260 2264-2265 25ca
fb00-fb04 feff fffc-fffd

Use fc-query for a .ttf file and fc-match for an installed font name.

This likely doesn't involve installing any extra packages, and doesn't involve translating a bitmap.

Use fc-match --format='%{file}\n' to check whether the right font is being matched.

Solution 4

fc-query my-font.ttf will give you a map of supported glyphs and all the locales the font is appropriate for according to fontconfig

Since pretty much all modern linux apps are fontconfig-based this is much more useful than a raw unicode list

The actual output format is discussed here http://lists.freedesktop.org/archives/fontconfig/2013-September/004915.html

Solution 5

Here is a POSIX[1] shell script that can print the code point and the character in a nice and easy way with the help of fc-match which is mentioned in Neil Mayhew's answer (it can even handle up to 8-hex-digit Unicode):

#!/bin/bash
for range in $(fc-match --format='%{charset}\n' "$1"); do
    for n in $(seq "0x${range%-*}" "0x${range#*-}"); do
        n_hex=$(printf "%04x" "$n")
        # using \U for 5-hex-digits
        printf "%-5s\U$n_hex\t" "$n_hex"
        count=$((count + 1))
        if [ $((count % 10)) = 0 ]; then
            printf "\n"
        fi
    done
done
printf "\n"

You can pass the font name or anything that fc-match accepts:

$ ls-chars "DejaVu Sans"

Updated content:

I learned that subshell is very time consuming (the printf subshell in my script). So I managed to write a improved version that is 5-10 times faster!

#!/bin/bash
for range in $(fc-match --format='%{charset}\n' "$1"); do
    for n in $(seq "0x${range%-*}" "0x${range#*-}"); do
        printf "%04x\n" "$n"
    done
done | while read -r n_hex; do
    count=$((count + 1))
    printf "%-5s\U$n_hex\t" "$n_hex"
    [ $((count % 10)) = 0 ] && printf "\n"
done
printf "\n"

Old version:

$ time ls-chars "DejaVu Sans" | wc
    592   11269   52740

real    0m2.876s
user    0m2.203s
sys     0m0.888s

New version (the line number indicates 5910+ characters, in 0.4 seconds!):

$ time ls-chars "DejaVu Sans" | wc
    592   11269   52740

real    0m0.399s
user    0m0.446s
sys     0m0.120s

End of update

Sample output (it aligns better in my st terminal 😆):

0020    0021 !  0022 "  0023 #  0024 $  0025 %  0026 &  0027 '  0028 (  0029 )
002a *  002b +  002c ,  002d -  002e .  002f /  0030 0  0031 1  0032 2  0033 3
0034 4  0035 5  0036 6  0037 7  0038 8  0039 9  003a :  003b ;  003c <  003d =
003e >  003f ?  0040 @  0041 A  0042 B  0043 C  0044 D  0045 E  0046 F  0047 G
...
1f61a😚 1f61b😛 1f61c😜 1f61d😝 1f61e😞 1f61f😟 1f620😠 1f621😡 1f622😢 1f623😣
1f625😥 1f626😦 1f627😧 1f628😨 1f629😩 1f62a😪 1f62b😫 1f62d😭 1f62e😮 1f62f😯
1f630😰 1f631😱 1f632😲 1f633😳 1f634😴 1f635😵 1f636😶 1f637😷 1f638😸 1f639😹
1f63a😺 1f63b😻 1f63c😼 1f63d😽 1f63e😾 1f63f😿 1f640🙀 1f643🙃

[1] Seems \U in printf is not POSIX standard?

Share:
36,645
Till Ulen
Author by

Till Ulen

Updated on July 05, 2022

Comments

  • Till Ulen
    Till Ulen almost 2 years

    How do I extract the list of supported Unicode characters from a TrueType or embedded OpenType font on Linux?

    Is there a tool or a library I can use to process a .ttf or a .eot file and build a list of code points (like U+0123, U+1234, etc.) provided by the font?

  • euxneks
    euxneks about 9 years
    xfd also gives the hex values as you need to type them in for unicode ala ctrl+shift+u
  • rspeer
    rspeer almost 9 years
    Opening up a GUI character map is not at all the same thing as listing the supported characters.
  • Skippy le Grand Gourou
    Skippy le Grand Gourou over 7 years
    int(sys.argv[2], 0) will probably fail with "invalid literal" in most case, since one probably wants to find special characters. Use ord(sys.argv[2].decode('string_escape').decode('utf-8')) instead.
  • Skippy le Grand Gourou
    Skippy le Grand Gourou over 7 years
    Anyway, this script based on python-fontconfig seems much faster : unix.stackexchange.com/a/268286/26952
  • Martin Tournoij
    Martin Tournoij over 7 years
    @SkippyleGrandGourou That sentence seems right? It passes sys.argv[1] to TTFont()?
  • mivk
    mivk almost 6 years
    Yes, it should be possible. But it's a complex suite of modules, with miserable documentation. So without an example of how it could be done, this answer seems quite useless.
  • mirabilos
    mirabilos about 5 years
    This lies: it says “Gentium Italic” has, among others, “2150-2185”, but 2161 is definitely not in it.
  • Neil Mayhew
    Neil Mayhew about 5 years
    @mirabilos I have Gentium 5.000 and it definitely does contain 2161: ttx -t cmap -o - /usr/share/fonts/truetype/GentiumPlus-I.ttf | grep 0x2161 returns <map code="0x2161" name="uni2161"/><!-- ROMAN NUMERAL TWO -->. It's possible FontConfig is matching to a different font. Before I installed gentium, fc-match 'Gentium Italic' returned FreeMono.ttf: "FreeMono" "Regular". If so, the output of --format=%{charset} would not show what you expect.
  • Neil Mayhew
    Neil Mayhew about 5 years
    I added a note mentioning the need to check whether the right font is being matched
  • mirabilos
    mirabilos about 5 years
    Gentium Plus ≠ Gentium (I have all three, normal, Basic and Plus installed, but I was wondering about Gentium) – ah nvm, I see the problem: $ fc-match --format='%{file}\n' Gentium /usr/share/fonts/truetype/gentium/Gentium-R.ttf $ fc-match --format='%{file}\n' Gentium\ Italic /usr/share/fonts/truetype/dejavu/DejaVuSans.ttf $ fc-match --format='%{file}\n' Gentium:Italic /usr/share/fonts/truetype/gentium/Gentium-I.ttf And fc-match --format='%{file} ⇒ %{charset}\n' Gentium:Italic DTRT, wonderful.
  • Neil Mayhew
    Neil Mayhew about 5 years
    Glad it worked out for you. Good tip about Gentium:Italic instead of Gentium Italic, too. Thanks for that.
  • mivk
    mivk about 5 years
    Note that ttx is part of the fonttools mentioned in the accepted answer. It's a Python script, so it's also available on Mac and Linux.
  • Ismael EL ATIFI
    Ismael EL ATIFI almost 5 years
    You can simplify : chars = chain.from_iterable([y + (Unicode[y[0]],) for y in x.cmap.items()] for x in ttf["cmap"].tables) by chars = list(y + (Unicode[y[0]],) for x in ttf["cmap"].tables for y in x.cmap.items())
  • Mayer Goldberg
    Mayer Goldberg over 4 years
    How would you modify this script to work with otf fonts too?
  • domsson
    domsson almost 4 years
    I wonder if a similar thing is possible for the built-in bitmap fonts, like 6x13?
  • vatosarmat
    vatosarmat over 3 years
    #!/bin/sh => #!/bin/bash
  • Lu Xu
    Lu Xu over 3 years
    @vatosarmat, right, it should be something like bash, thanks. I guess the former works for me becuase the shell uses exectable printf instead of shell built-in.
  • Lu Xu
    Lu Xu over 3 years
    Correction to last comment: #!/bin/sh shebang does not work for me either, maybe I really haven't tried it. My bad.
  • Cameron Kerr
    Cameron Kerr about 3 years
    \U may require 6 characters; \u for 4 characters. This is fairly typical for programming languages (otherwise its ambiguous), although some things make be a bit lax. Makes a difference on Ubuntu 20.04 at least, where printf \U1f643 prints \u0001F643 (surrogate pair?), but \U01f643 returns 🙃
  • Lu Xu
    Lu Xu about 3 years
    @CameronKerr so adding '0's like "\U0$n_hex" in the printf line works for you on Ubuntu 20.04?
  • Lu Xu
    Lu Xu about 3 years
    Or, if \U requires at least 6 characters in your case, does that mean printf "\U0030" won't even work as desired? I really haven't tested the script on other systems than arch.
  • Cameron Kerr
    Cameron Kerr about 3 years
    Hmm, '\U0030' produces a '0', and '\U0030 ' produces '0 '. '\U0030a' produces '\u030a' (leading zeros, normalising to \u with 4 digits). However, as others have pointed out, this is bash builtin, not POSIX printf. /usr/bin/printf '\U0030' gives 'missing hexadecimal number in escape', and /usr/bin/printf '\u0030' gives 'invalid universal character name \u0030', but that's only because it should be specified as '0'. gnu-coreutils.7620.n7.nabble.com/…
  • Lu Xu
    Lu Xu about 3 years
    @CameronKerr, wow, thanks for all the research! That is more complicated than I anticipated, and the mailing list might be a bit beyond my knowledge :). From my understanding, the GNU's standalone printf program is more strict on what can follow '\u' and '\U' than the bash built-in? I can reproduce with /usr/bin/printf from coreutils package now, tho.
  • Lennart Regebro
    Lennart Regebro almost 3 years
    This only works for installed fonts, unfortnately. It would be handy to get this list before installing the font.
  • jinyong lee
    jinyong lee over 2 years
    This displayed empty rectangles for the unsupported characters.
  • rdrg109
    rdrg109 over 2 years
    You can make ttx show the output in STDOUT by using -o -. For example, ttx -o - -t cmap myfont.ttf will dump the content of the cmap table in the font myfont.ttf to STDOUT. You can then use it to see if a given character is defined in a given (e.g.$ font ttx -o - -t cmap myfont.ttf | grep '5c81')
  • rebane2001
    rebane2001 about 2 years
    I wanted to know how many characters were in a font, so here's a simplified oneliner of this answer that only counts characters: for range in $(fc-match --format='%{charset}' "$1"); do seq "0x${range%-*}" "0x${range#*-}"; done | wc -l