Can not use `cut -c` (`--characters`) with UTF-8?
Solution 1
You haven't said which cut
you're using, but since you've mentioned the GNU long option --characters
I'll assume it's that one. In that case, note this passage from info coreutils 'cut invocation'
:
‘-c character-list’ ‘--characters=character-list’
Select for printing only the characters in positions listed in character-list. The same as
-b
for now, but internationalization will change that.
(emphasis added)
For the moment, GNU cut
always works in terms of single-byte "characters", so the behaviour you see is expected.
Supporting both the -b
and -c
options is required by POSIX — they weren't added to GNU cut
because it had multi-byte support and they worked properly, but to avoid giving errors on POSIX-compliant input. The same -c
has been done in some other cut
implementations, although not FreeBSD's and OS X's at least.
This is the historic behaviour of -c
. -b
was newly added to take over the byte role so that -c
can work with multi-byte characters. Maybe in a few years it will work as desired consistently, although progress hasn't exactly been quick (it's been over a decade already). GNU cut
doesn't even implement the -n
option yet, even though it is orthogonal and intended to help the transition. There are potential compatibility problems with old scripts, which may be a concern, although I don't know definitively what the reason is.
Solution 2
colrm
(part of util-linux
, should be already installed on most distributions) seems to handle internationalization much better :
$ echo 'αβγ' | colrm 3
αβ
$ echo 'αβγ' | colrm 2
α
Beware of the numbering : colrm N
will remove columns from N
, printing characters up to N-1
.
(credits)
Solution 3
Since many grep
implementations are multibyte-aware, you can also use grep -o
to simulate some uses of cut -c
.
First two characters:
$ echo Τηεοδ29 | grep -o '^..'
Τη
Last two characters:
$ echo Τηεοδ29 | grep -o '...$'
δ29
Second character:
$ echo Τηεοδ29 | grep -o '^..' | grep -o '.$'
η
Adjust the number of periods, or use {x,y}
syntax, to simulate cut
ranges.
Related videos on Youtube
Volker Siegel
I like to answer older questions, if I have an additional perspective to the question to give. Adding an additional answer even if there are valid answers can still add value to the collection of answers and questions we are building here. (A late answer does, by it's nature, get not much attention, so it leads to exceptionally low reputation per answer. But hey, that's life, right?) And I feel it's the important thing here: We're answering professional questions in a professional way, and often do that quickly. That's of great value for the general public. But the real thing of value, that is of value hard to describe in simple terms, is the body of text, the whole collection that we are creating here together. All participants here, whether he or she cares more about asking, answering or collecting questions and answers. This applies to all StackExchange sites and topics in the same way. In some sites and topics, I like to add questions that have only the purpose of growing the collection, often more academic than practical, and of general interest while not too trivial. I'm here because I want to take part in the creation of this exceptional body of well structured knowledge.
Updated on September 18, 2022Comments
-
Volker Siegel over 1 year
The command
cut
has an option-c
to work on characters, instead of bytes with the option-b
. But that does not seem to work, inen_US.UTF-8
locale:The second byte gives the second ASCII character (which is encoded just the same in UTF-8):
$ printf 'ABC' | cut -b 2 B
but does not give the second of three greek non-ASCII characters in UTF-8 locale:
$ printf 'αβγ' | cut -b 2 �
That's alright - it's the second byte.
So we look at the second character instead:$ printf 'αβγ' | cut -c 2 �
That looks broken.
With some experiments, it turns out that the range3-4
shows the second character:$ printf 'αβγ' | cut -c 3-4 β
But that's just the same as the bytes 3 to 4:
$ printf 'αβγ' | cut -b 3-4 β
So the
-c
does not more than the-b
for UTF-8.I'd expect the locale setup is not right for UTF-8, but in comparison,
wc
works as expected;
It is often used to count bytes, with option-c
(--bytes
). (Note the confusing option names.)$ printf 'αβγ' | wc -c 6
But it can also count characters with option
-m
(--chars
), which just works:$ printf 'αβγ' | wc -m 3
So my configuration seems to be ok - but something is special about
cut
.Maybe it does not support UTF-8 at all? But it does seem to support multi-byte characters, otherwise it would not need to support
-b
and-c
.So, what's wrong? And why?
The locale setup looks right for utf8, as far as I can tell:
$ locale LANG=en_US.UTF-8 LANGUAGE=en_US LC_CTYPE=en_US.UTF-8 LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
The input, byte by byte:
$ printf 'αβγ' | hd 00000000 ce b1 ce b2 ce b3 |......| 00000006
-
michas over 9 yearsInteresting! It looks like
-c
is using the same code as-b
. Did you have a look at the source code? Maybe you can find a hint what-c
is actually meant for.
-
-
mikeserv over 9 yearsgood work. youll find the same kind of comments in GNU's
tr
docs as well. and eventar
unless i misremember. i guess its a big project. -
myrdd over 5 yearssee this 2017 article, sub-titled ”Random notes and pointers regarding the on-going effort to add multibyte and unicode support in GNU Coreutils“: crashcourse.housegordon.org/coreutils-multibyte-support.html
-
myrdd over 5 yearsyou can find some alternatives to
cut -c
here: superuser.com/questions/506164/… -
Admin almost 2 yearscolrm doesn't seem to handle emojis well:
echo '😀removethis' | colrm 2
returns nothing for me. -
Admin almost 2 years@frabjous They seem to count for two characters, try
echo '😀removethis' | colrm 3
. ;) -
Admin almost 2 years@frabjous Indeed U+1F600 is out of the UTF-8 range, it’s UTF-16.