What's the character encoding used?

14,132

Solution 1

Easy search on gnome charmap:

U+0E01 THAI CHARACTER KO KAI

General Character Properties

In Unicode since: 1.1
Unicode category: Letter, Other

Various Useful Representations

UTF-8: 0xE0 0xB8 0x81
UTF-16: 0x0E01

C octal escaped UTF-8: \340\270\201
XML decimal entity: ก

followed by (one or more of / a variation of):

U+0E47 THAI CHARACTER MAITAIKHU

General Character Properties

In Unicode since: 1.1
Unicode category: Mark, Non-Spacing

Various Useful Representations

UTF-8: 0xE0 0xB9 0x87
UTF-16: 0x0E47

C octal escaped UTF-8: \340\271\207
XML decimal entity: ็

Annotations and Cross References

Alias names:
 • mai taikhu

The second is a non-spacing mark decorating the first char

Solution 2

Entering those characters in the search box on Graphmenica will take you to this page, showing the different characters being used:

Share:
14,132

Related videos on Youtube

Marcus Hansson
Author by

Marcus Hansson

Updated on June 17, 2022

Comments

  • Marcus Hansson
    Marcus Hansson about 2 years

    Odd character codes:

    ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้

    Question: What's the encoding of these characters?


    (Tip: Try editing this question and you'll see why they're odd, LIVE)

    Yeah, that's right. You see the same thing I do.
    Apparently, this came from a mac. So, with the little knowledge of the subject I have, I fired up notepad++ and tried to view it in hex.

    The result? Try it yourself: http://notepad-plus-plus.org/

    Fairly obvious; What the hell? I can understand if it is Just a Bunch of Bits in some weird proprietary binary encoding (containing stuff like color, font, etc. etc.). But why do they show up so strange?


    Also, why do notepad++ not show the original characters from the beginning? If you turn on the hex-editor and then turn it off, it's like it expands.


    (Also (again), try copy-pasting the above characters twice into notepad++. See the difference? Nothing but 0x3f and the occasional 0x20. This is also true for each individual character. As far as I know, neither a space nor a question-mark looks like the above characters. But oh, I may be wrong of course..)

    Here's a snippet from outlook:

    Do you see that?!?!

    EDIT: Editing these characters using UTF-8 instead of stupid ANSI actually lets you see the correct bytes.

    EDIT 2: I probably should have been more clear in what I wanted to know when I wrote the question (in my defence, I was so grossed out I just wanted to scream BRAINOVERFLOW when I saw it [the screenshot]).

    EDIT 3: (copied from yahoo answer) It appears to be a thing called "stacking diacritics" using Thai characters.

    Essentially the Thai character ก "ko kai" can have any of several superscripted diacritic marks such as ็ "maitaikhu". If you follow "ko kai" with "maitaikhu", the latter appears as a superscript thus: ก็

    If you put further diacritics after such a combination, they'll stack thus: ก็็็็็

    Here are the characters that will do it: http://graphemica.com/search?q=%E0%B8%81

    • wintersolutions
      wintersolutions over 12 years
      This is great, maybe this belongs to meta tough?
    • Jiahua Wang
      Jiahua Wang over 12 years
      I don't think this is specific to SO.
    • adelphus
      adelphus over 12 years
      That is truly awesome! But also a bit rubbish.
    • Marcus Hansson
      Marcus Hansson over 12 years
      PizzaPill: No, actually not. It isn't specific to SO; The question is about that string of characters at the top of the question.
    • Dennis
      Dennis over 12 years
    • AakashM
      AakashM over 12 years
      btw, if you really typed in all that HTML by hand you might want to take a look at the editing help - Markdown is much easier.
    • Marcus Hansson
      Marcus Hansson over 12 years
      AakashM: Well, it isn't that much. It's just a couple of <br>.
    • wintersolutions
      wintersolutions over 12 years
      @MarcusHansson I just mentally categorized this as a question and as a bug in stackoverflow. It looks crazy but its 100% legal. Unicode ftw.
  • tripleee
    tripleee over 12 years
    The encoding is apparently UTF-8, if that's what the OP really wants to know. Maybe Windows has a problem with Thai combining diacritics? On my iPhone, they are overlaid, not stacked like in the screenshot.
  • guido
    guido over 12 years
    Chrome on linux is hit by the same problem (stacking diacritics) on third, sixth and nineth combination, while it's overlying the other combinations. I think it depends on that the are many adjacent diacritics following a character, while probably only one is expected.