Some UTF-8 characters not being recognized by grep or sed

351

Solution 1

The problem is that sort and uniq are using collation information for the locale. Switching the locale off for the two commands works:

cat sample | awk '{print $2}' | grep -o . | LC_ALL=C sort | LC_ALL=C uniq -c | sort -n
      1 ʊ
      1 ʌ
      1 a
      1 æ
      1 i
      1 v
      2 ʃ
      2 d
      2 t
      3 e
      3 l
      3 ɔ
      3 r
      4 ɪ
      4 n
      9 ˈ
      9 b
     11 ə

Solution 2

by using 'C' locale however you lose human collating (like making 'a' and 'A' equivalent).

if you need both to collate and to handle some chars unhandled by glibc locale data; you can create your own locale by expanding default collation.

You can copy the definition of your current locale (eg /usr/share/i18n/locales/en_US ) to another name. Then edit it, and in the LC_COLLATE section have:

LC_COLLATE
copy "iso14651_t1"

reorder-after <e>
<U0259> <e>;<PCL>;<MIN>;IGNORE
reorder-after <s>
<U0283> <s>;<PCL>;<MIN>;IGNORE
reorder-end

END LC_COLLATE

compile it with: localedef -f ./yourmodifiedfile -t UTF-8 ./someplace then you can use LC_ALL=./someplace instead of LC_ALL=C

if you want to use that regularly, put the created directory with the other standard locales (usually /usr/share/locale or /usr/lib/locale ) and name it in a standard way (eg, if it is based on en_US you could name it "en_US@IPA" for example. Then you can set up your locales to have LC_COLLATE=en_US@IPA permanently (note you must not define LC_ALL if you want to individually define some LC_* variables)

Note also U+02C8 is a modifier, and so should rightfully be ignored in collation. But if you need to handle it as a separate character, you can use it instead (ascii single quote put as same (for collate view) as U+02C8, as that is how it is often typed):

# defines a handy symbol, to group together similar chars
collating-symbol <'>

# define 
reorder-after <z>
<'>

reorder-after <e>
<U0259> <e>;<PCL>;<MIN>;IGNORE
reorder-after <s>
<U0283> <s>;<PCL>;<MIN>;IGNORE
reorder-after <'>
<U0027> <'>;<BAS>;IGNORE;IGNORE
<U02C8> <'>;<PCL>;IGNORE;IGNORE

reorder-end

lines are: <unicode value> <1st level>;<2d level>;<3d level>;<4th level> the levels are what is used to sort them.

I think (but not tried, I let that as an exercise :) ) that if you just define the last level it will behave as mostly ignored for sorting, but still "different" from the uniq point of view (as long as the chain of all levels is unique the character is unique, I think).

Usually 1st level is a grouping symbol, like for all the e-like letters. 2nd level is usually for the base character, there are several other symbols for various accented versions, and (peculiar?) is used for "special". 3d level is usually used to differentiate uppercase and lower case and things like that.

Share:
351
Saif Bechan
Author by

Saif Bechan

Updated on September 18, 2022

Comments

  • Saif Bechan
    Saif Bechan over 1 year

    I am using a html minifier, which can be found here: HTML minify

    The strange thing to me is that every tag is placed on a new line. Is this common behavior or am I doing something wrong. The output looks something like this:

    Output from html minify

    Anyone know how I can fix this so that is just creates one line of code, or is has this was of minifying some advantages.

  • Saif Bechan
    Saif Bechan over 12 years
    I did check the code, but regex is just a black box for me, after all the years I still don't understand it. Nice point about the browsers and long lines, I had no idea. I think ill just keep it like this.
  • Saif Bechan
    Saif Bechan over 12 years
    Oh I see it was even commented in, I did not see that. Sorry!
  • Richard
    Richard about 12 years
    "Long lines can be a bad bad thing - browsers might fill buffers or just drop stuff at the end of the line." - Do you have any references for that or is it just a thought? Seems like non-sense to me..
  • choroba
    choroba over 10 years
    @StephaneChazelas: I am not sure but I guess the two characters are "equivalent" under the locale (they correspond to nothing).
  • derobert
    derobert over 10 years
    ... or there is just a bug in the locale. That seems pretty weird, I doubt it was intended.
  • Stéphane Chazelas
    Stéphane Chazelas over 10 years
    ltrace on uniq does indeed show strcoll("\312\203", "\311\252") = 0
  • Stéphane Chazelas
    Stéphane Chazelas over 10 years
    Actually, beside the point of ɪ, and ʃ having the same sorting rank, I don't see why uniq should use strcoll. I's meant to find unique lines, not line that have the same sorting order. IMO, memcmp should be all it needs.
  • Stéphane Chazelas
    Stéphane Chazelas over 10 years
    OK, I suppose sort can't guarantee that identical (as in byte-to-byte) lines are adjacent, so uniq has to use strcoll there. So your answer is perfectly to the point. And sadly, we have to set LC_ALL to C for both uniq and sort whenever we need to use uniq.
  • choroba
    choroba over 10 years
    @StephaneChazelas: Nice analysis. I agree that the result is unfortunate, but inevitable.
  • Stéphane Chazelas
    Stéphane Chazelas over 10 years
    @derobert, yes U0234 to U07FF all sort the same (glibc 2.17). That can't be right.