Can not use `cut -c` (`--characters`) with UTF-8?

text-processing character-encoding unicode cut

5,642

Solution 1

You haven't said which cut you're using, but since you've mentioned the GNU long option --characters I'll assume it's that one. In that case, note this passage from info coreutils 'cut invocation':

‘-c character-list’
‘--characters=character-list’
Select for printing only the characters in positions listed in character-list. The same as -b for now, but internationalization will change that.

(emphasis added)

For the moment, GNU cut always works in terms of single-byte "characters", so the behaviour you see is expected.

Supporting both the -b and -c options is required by POSIX — they weren't added to GNU cut because it had multi-byte support and they worked properly, but to avoid giving errors on POSIX-compliant input. The same -c has been done in some other cut implementations, although not FreeBSD's and OS X's at least.

This is the historic behaviour of -c. -b was newly added to take over the byte role so that -c can work with multi-byte characters. Maybe in a few years it will work as desired consistently, although progress hasn't exactly been quick (it's been over a decade already). GNU cut doesn't even implement the -n option yet, even though it is orthogonal and intended to help the transition. There are potential compatibility problems with old scripts, which may be a concern, although I don't know definitively what the reason is.

Solution 2

colrm (part of util-linux, should be already installed on most distributions) seems to handle internationalization much better :

$ echo 'αβγ' | colrm 3
αβ
$ echo 'αβγ' | colrm 2
α

Beware of the numbering : colrm N will remove columns from N, printing characters up to N-1.

(credits)

Solution 3

Since many grep implementations are multibyte-aware, you can also use grep -o to simulate some uses of cut -c.

First two characters:

$ echo Τηεοδ29 | grep -o '^..'
Τη

Last two characters:

$ echo Τηεοδ29 | grep -o '...$'
δ29

Second character:

$ echo Τηεοδ29 | grep -o '^..' | grep -o '.$'
η

Adjust the number of periods, or use {x,y} syntax, to simulate cut ranges.

5,642

Volker Siegel

I like to answer older questions, if I have an additional perspective to the question to give. Adding an additional answer even if there are valid answers can still add value to the collection of answers and questions we are building here. (A late answer does, by it's nature, get not much attention, so it leads to exceptionally low reputation per answer. But hey, that's life, right?) And I feel it's the important thing here: We're answering professional questions in a professional way, and often do that quickly. That's of great value for the general public. But the real thing of value, that is of value hard to describe in simple terms, is the body of text, the whole collection that we are creating here together. All participants here, whether he or she cares more about asking, answering or collecting questions and answers. This applies to all StackExchange sites and topics in the same way. In some sites and topics, I like to add questions that have only the purpose of growing the collection, often more academic than practical, and of general interest while not too trivial. I'm here because I want to take part in the creation of this exceptional body of well structured knowledge.

Updated on September 18, 2022

Comments

Volker Siegel over 1 year
The command cut has an option -c to work on characters, instead of bytes with the option -b. But that does not seem to work, in en_US.UTF-8 locale:

The second byte gives the second ASCII character (which is encoded just the same in UTF-8):
```
$ printf 'ABC' | cut -b 2          
B
```
but does not give the second of three greek non-ASCII characters in UTF-8 locale:
```
$ printf 'αβγ' | cut -b 2         
�
```
That's alright - it's the second byte.
So we look at the second character instead:
```
$ printf 'αβγ' | cut -c 2 
�
```
That looks broken.
With some experiments, it turns out that the range 3-4 shows the second character:
```
$ printf 'αβγ' | cut -c 3-4
β
```
But that's just the same as the bytes 3 to 4:
```
$ printf 'αβγ' | cut -b 3-4
β
```
So the -c does not more than the -b for UTF-8.

I'd expect the locale setup is not right for UTF-8, but in comparison, wc works as expected;
It is often used to count bytes, with option -c (--bytes). ^{(Note the confusing option names.)}
```
$ printf 'αβγ' | wc -c
6
```
But it can also count characters with option -m (--chars), which just works:
```
$ printf 'αβγ' | wc -m
3
```
So my configuration seems to be ok - but something is special about cut.

Maybe it does not support UTF-8 at all? But it does seem to support multi-byte characters, otherwise it would not need to support -b and -c.

So, what's wrong? And why?

The locale setup looks right for utf8, as far as I can tell:
```
$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
```
The input, byte by byte:
```
$ printf 'αβγ' | hd 
00000000  ce b1 ce b2 ce b3                                 |......|
00000006
```
- michas over 9 years
  
  Interesting! It looks like -c is using the same code as -b. Did you have a look at the source code? Maybe you can find a hint what -c is actually meant for.
mikeserv over 9 years

good work. youll find the same kind of comments in GNU's tr docs as well. and even tar unless i misremember. i guess its a big project.
myrdd over 5 years

see this 2017 article, sub-titled ”Random notes and pointers regarding the on-going effort to add multibyte and unicode support in GNU Coreutils“: crashcourse.housegordon.org/coreutils-multibyte-support.html
myrdd over 5 years

you can find some alternatives to cut -c here: superuser.com/questions/506164/…
Admin almost 2 years

colrm doesn't seem to handle emojis well: echo '😀removethis' | colrm 2 returns nothing for me.
Admin almost 2 years

@frabjous They seem to count for two characters, try echo '😀removethis' | colrm 3. ;)
Admin almost 2 years

@frabjous Indeed U+1F600 is out of the UTF-8 range, it’s UTF-16.