Some UTF-8 characters not being recognized by grep or sed
Solution 1
The problem is that sort
and uniq
are using collation information for the locale. Switching the locale off for the two commands works:
cat sample | awk '{print $2}' | grep -o . | LC_ALL=C sort | LC_ALL=C uniq -c | sort -n
1 ʊ
1 ʌ
1 a
1 æ
1 i
1 v
2 ʃ
2 d
2 t
3 e
3 l
3 ɔ
3 r
4 ɪ
4 n
9 ˈ
9 b
11 ə
Solution 2
by using 'C' locale however you lose human collating (like making 'a' and 'A' equivalent).
if you need both to collate and to handle some chars unhandled by glibc locale data; you can create your own locale by expanding default collation.
You can copy the definition of your current locale (eg /usr/share/i18n/locales/en_US ) to another name. Then edit it, and in the LC_COLLATE section have:
LC_COLLATE
copy "iso14651_t1"
reorder-after <e>
<U0259> <e>;<PCL>;<MIN>;IGNORE
reorder-after <s>
<U0283> <s>;<PCL>;<MIN>;IGNORE
reorder-end
END LC_COLLATE
compile it with: localedef -f ./yourmodifiedfile -t UTF-8 ./someplace
then you can use LC_ALL=./someplace
instead of LC_ALL=C
if you want to use that regularly, put the created directory with the other standard locales (usually /usr/share/locale or /usr/lib/locale ) and name it in a standard way (eg, if it is based on en_US you could name it "en_US@IPA" for example. Then you can set up your locales to have LC_COLLATE=en_US@IPA permanently (note you must not define LC_ALL if you want to individually define some LC_* variables)
Note also U+02C8 is a modifier, and so should rightfully be ignored in collation. But if you need to handle it as a separate character, you can use it instead (ascii single quote put as same (for collate view) as U+02C8, as that is how it is often typed):
# defines a handy symbol, to group together similar chars
collating-symbol <'>
# define
reorder-after <z>
<'>
reorder-after <e>
<U0259> <e>;<PCL>;<MIN>;IGNORE
reorder-after <s>
<U0283> <s>;<PCL>;<MIN>;IGNORE
reorder-after <'>
<U0027> <'>;<BAS>;IGNORE;IGNORE
<U02C8> <'>;<PCL>;IGNORE;IGNORE
reorder-end
lines are: <unicode value> <1st level>;<2d level>;<3d level>;<4th level>
the levels are what is used to sort them.
I think (but not tried, I let that as an exercise :) ) that if you just define the last level it will behave as mostly ignored for sorting, but still "different" from the uniq point of view (as long as the chain of all levels is unique the character is unique, I think).
Usually 1st level is a grouping symbol, like for all the e-like letters. 2nd level is usually for the base character, there are several other symbols for various accented versions, and (peculiar?) is used for "special". 3d level is usually used to differentiate uppercase and lower case and things like that.
Saif Bechan
Updated on September 18, 2022Comments
-
Saif Bechan over 1 year
I am using a html minifier, which can be found here: HTML minify
The strange thing to me is that every tag is placed on a new line. Is this common behavior or am I doing something wrong. The output looks something like this:
Anyone know how I can fix this so that is just creates one line of code, or is has this was of minifying some advantages.
-
Saif Bechan over 12 yearsI did check the code, but regex is just a black box for me, after all the years I still don't understand it. Nice point about the browsers and long lines, I had no idea. I think ill just keep it like this.
-
Saif Bechan over 12 yearsOh I see it was even commented in, I did not see that. Sorry!
-
Richard about 12 years"Long lines can be a bad bad thing - browsers might fill buffers or just drop stuff at the end of the line." - Do you have any references for that or is it just a thought? Seems like non-sense to me..
-
choroba over 10 years@StephaneChazelas: I am not sure but I guess the two characters are "equivalent" under the locale (they correspond to nothing).
-
derobert over 10 years... or there is just a bug in the locale. That seems pretty weird, I doubt it was intended.
-
Stéphane Chazelas over 10 years
ltrace
onuniq
does indeed showstrcoll("\312\203", "\311\252") = 0
-
Stéphane Chazelas over 10 yearsActually, beside the point of ɪ, and ʃ having the same sorting rank, I don't see why
uniq
should usestrcoll
. I's meant to find unique lines, not line that have the same sorting order. IMO,memcmp
should be all it needs. -
Stéphane Chazelas over 10 yearsOK, I suppose
sort
can't guarantee that identical (as in byte-to-byte) lines are adjacent, souniq
has to usestrcoll
there. So your answer is perfectly to the point. And sadly, we have to set LC_ALL to C for bothuniq
and sort whenever we need to useuniq
. -
choroba over 10 years@StephaneChazelas: Nice analysis. I agree that the result is unfortunate, but inevitable.
-
Stéphane Chazelas over 10 years@derobert, yes
U0234
toU07FF
all sort the same (glibc 2.17). That can't be right.