LESSCHARSET=utf-8 less doesn't seem to work

utf-8 unix

18,435

Solution 1

What does the locale command output? Is it a UTF-8 locale?
Are you sure your terminal is set to display UTF-8? Does echo -e '\xe2\x82\xac' produce the € (euro) sign?
Is the locale that you have set even installed on the system? Is it present in the list that locale -a outputs?
What version of less are you using? (Run less --version to find out.) Really, really old versions did not even support LESSCHARSET. This is less likely to be the case, because I have a Debian "sarge" system with less version 382, and it does not even need LESSCHARSET if the locale is set correctly.

Solution 2

My guess is that your file isn't UTF8 but rather ISO8859. (Is the <F4> character supposed to be a 'ô'?)

Start an xterm with LANG=en_US.ISO-8859-1 xterm. Then verify the locale (the output of locale should be something like en_US.ISO-8859-1). Then use less to view the file. Does it display correctly?

Note that it isn't enough to just use LESSCHARSET=iso8859 without starting a new terminal. LESSCHARSET tells less to think that the terminal can interpret iso8859, but your terminal probably displays UTF8, since the euro sign displays correctly. But as \xf4 isn't a valid utf8 character, the terminal will probably show something like '�'.

Solution 3

On Mac OS a charset name have to be in upper case:

bash-4.4$ less --version
less 458 (POSIX regular expressions)
Copyright (C) 1984-2012 Mark Nudelman
bash-4.4$ LESSCHARSET=cp1251 less
invalid charset name
bash-4.4$ LESSCHARSET=CP1251 less
Missing filename ("less --help" for help)

Here I found a list of charsets:

{ "ascii",          NULL,       "8bcccbcc18b95.b" },
{ "utf-8",          &utf_mode,  "8bcccbcc18b95.b126.bb" },
{ "iso8859",        NULL,       "8bcccbcc18b95.33b." },
{ "latin3",         NULL,       "8bcccbcc18b95.33b5.b8.b15.b4.b12.b18.b12.b." },
{ "arabic",         NULL,       "8bcccbcc18b95.33b.3b.7b2.13b.3b.b26.5b19.b" },
{ "greek",          NULL,       "8bcccbcc18b95.33b4.2b4.b3.b35.b44.b" },
{ "greek2005",      NULL,       "8bcccbcc18b95.33b14.b35.b44.b" },
{ "hebrew",         NULL,       "8bcccbcc18b95.33b.b29.32b28.2b2.b" },
{ "koi8-r",         NULL,       "8bcccbcc18b95.b." },
{ "KOI8-T",         NULL,       "8bcccbcc18b95.b8.b6.b8.b.b.5b7.3b4.b4.b3.b.b.3b." },
{ "georgianps",     NULL,       "8bcccbcc18b95.3b11.4b12.2b." },
{ "tcvn",           NULL,       "b..b...bcccbccbbb7.8b95.b48.5b." },
{ "TIS-620",        NULL,       "8bcccbcc18b95.b.4b.11b7.8b." },
{ "next",           NULL,       "8bcccbcc18b95.bb125.bb" },
{ "dos",            NULL,       "8bcccbcc12bc5b95.b." },
{ "windows-1251",   NULL,       "8bcccbcc12bc5b95.b24.b." },
{ "windows-1252",   NULL,       "8bcccbcc12bc5b95.b.b11.b.2b12.b." },
{ "windows-1255",   NULL,       "8bcccbcc12bc5b95.b.b8.b.5b9.b.4b." },
{ "ebcdic",         NULL,       "5bc6bcc7bcc41b.9b7.9b5.b..8b6.10b6.b9.7b9.8b8.17b3.3b9.7b9.8b8.6b10.b.b.b." },
{ "IBM-1047",       NULL,       "4cbcbc3b9cbccbccbb4c6bcc5b3cbbc4bc4bccbc191.b" },
{ NULL, NULL, NULL }

and a list of corresponding aliases for them:

{ "UTF-8",          "utf-8" },
{ "ANSI_X3.4-1968", "ascii" },
{ "US-ASCII",       "ascii" },
{ "latin1",         "iso8859" },
{ "ISO-8859-1",     "iso8859" },
{ "latin9",         "iso8859" },
{ "ISO-8859-15",    "iso8859" },
{ "latin2",         "iso8859" },
{ "ISO-8859-2",     "iso8859" },
{ "ISO-8859-3",     "latin3" },
{ "latin4",         "iso8859" },
{ "ISO-8859-4",     "iso8859" },
{ "cyrillic",       "iso8859" },
{ "ISO-8859-5",     "iso8859" },
{ "ISO-8859-6",     "arabic" },
{ "ISO-8859-7",     "greek" },
{ "IBM9005",        "greek2005" },
{ "ISO-8859-8",     "hebrew" },
{ "latin5",         "iso8859" },
{ "ISO-8859-9",     "iso8859" },
{ "latin6",         "iso8859" },
{ "ISO-8859-10",    "iso8859" },
{ "latin7",         "iso8859" },
{ "ISO-8859-13",    "iso8859" },
{ "latin8",         "iso8859" },
{ "ISO-8859-14",    "iso8859" },
{ "latin10",        "iso8859" },
{ "ISO-8859-16",    "iso8859" },
{ "IBM437",         "dos" },
{ "EBCDIC-US",      "ebcdic" },
{ "IBM1047",        "IBM-1047" },
{ "KOI8-R",         "koi8-r" },
{ "KOI8-U",         "koi8-r" },
{ "GEORGIAN-PS",    "georgianps" },
{ "TCVN5712-1",     "tcvn" },
{ "NEXTSTEP",       "next" },
{ "windows",        "windows-1252" }, /* backward compatibility */
{ "CP1251",         "windows-1251" },
{ "CP1252",         "windows-1252" },
{ "CP1255",         "windows-1255" },
{ NULL, NULL }

Solution 4

Try the command file file.txt. If, for example, the output is "ISO-8859 English text" then change the encoding of the file from ISO-8859 to UTF-8 via the command iconv -f ISO-8859-1 -t UTF-8 -o testfile.txt file.txt. If less testfile.txt displays correctly, finish with mv testfile.txt file.txt.

View more solutions

18,435

Author by

dan

Updated on August 05, 2022

Comments

dan about 2 months

I'm trying to view a UTF-8 text file/stream in less, and even if I invoke it like this:

cat file | LESSCHARSET=utf-8 less

the non-ASCII compatible UTF-8 characters don't display correctly. Instead, their hex values appear highlighted in brackets, e.g. <F4>.

The reading the same text in vim with UTF-8 encoding poses no problems. So I'm thinking something is wrong with the way I'm invoking less.

My locale output is the following

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

My less version is the one installed by XCode on OSX Leopard:

$ less --version | sed 's/^/    /'
less 394
Copyright (C) 1984-2005 Mark Nudelman
less comes with NO WARRANTY, to the extent permitted by law.
For information about the terms of redistribution, 
see the file named README in the less distribution.
Homepage: http://www.greenwoodsoftware.com/less

locale -a | grep US | sed 's/^/ /' outputs the following:

en_AU.US-ASCII
en_CA.US-ASCII
en_GB.US-ASCII
en_NZ.US-ASCII
en_US
en_US.ISO8859-1
en_US.ISO8859-15
en_US.US-ASCII
en_US.UTF-8

Recents

Why am I getting some extra, weird characters when making a file from grep output?

Unix to verify file has no content and empty lines

BASH: can grep on command line, but not in script

Safari on iPad occasionally doesn't recognize ASP.NET postback links

anchor tag not working in safari (ios) for iPhone/iPod Touch/iPad

Logging SOAP request and response on server side

No value at JSON path "$.name", exception: json can not be null or empty, Using Mockmvc and Spring-boot

MySQL: django.db.utils.OperationalError: (1698, "Access denied for user 'root'@'localhost'") with correct username and pw

Engines in Python Pandas read_csv

What exceptions could be returned from Pandas read_sql()

text-align center is not working inside inline-block html table

NamedTuple to Dataframe

How to recode to UTF-8 conditionally?

UTF-8 and !# shell scripts

How to convert xml file which is in non UTF-8 format to xml that is UTF-8 compliant

How to detect if a file has a UTF-8 BOM in Bash?

why is zsh globbing not working with find command?

Java String.getBytes("UTF8") JavaScript analog

python - convert binary data to utf-8

Unix V6 Source code

UTF-8 and UTF-16 in Java

Why can't I ignore SIGSEGV signal?