What is this character: '*​'?

8,531

Solution 1

The paste failed not because of the asterisk, which is a perfectly regular asterisk, but because of the Unicode character U+200B. As the character is a ZERO WIDTH SPACE, it does not display when it is copied.

Using the Python code:

stro=u"'*​'?"
def uniconv(text):
    return " ".join(hex(ord(char)) for char in text)
uniconv(stro)

The function uniconv converts the input string (in this case, u"'*'?") into their Unicode codepage equivalents in hexadecimal format. The u prefix to the string identifies the string as a Unicode string.

I was able to obtain the output:

0x27 0x2a 0x200b 0x27 0x3f

We can clearly see that 0x27, 0x2a and 0x3f are the ASCII/Unicode hexadecimal values for the characters ',* and ? respectively. That leaves 0x200b, therefore identifying the character.

Note that the Python code, when pasted into the body, had the U+200B character removed by SE's Markdown software. In order to obtain the expected result, you need to copy it directly from the title using the Edit view.

Solution 2

With the help of @Rinzwind in the Ask Ubuntu chat room, I figured out that the problem isn't the character at all. Note the output of od:

$ printf '*​' | od -c
0000000   * 342 200 213
0000004

The 342 200 213 is an octal representation of another character and we can use this site to look it up:

Character                   ​               
Character name                              ZERO WIDTH SPACE
Hex code point                              200B
Decimal code point                          8203
Hex UTF-8 bytes                             E2 80 8B
Octal UTF-8 bytes                           342 200 213
UTF-8 bytes as Latin-1 characters bytes     â <80> <8B>

So, what I actually had was two unicode characters, the normal * and a zero width space.

Share:
8,531

Related videos on Youtube

terdon
Author by

terdon

Elected moderator on Unix &amp; Linux. I've been using Linux since the late '90s and have gone through a variety of distributions. At one time or another, I've been a user of Mandrake, SuSe, openSuSe, Fedora, RedHat, Ubuntu, Mint, Linux Mint Debian Edition (basically Debian testing but more green) and, for the past few years, Arch. My Linux expertise, such as it is, is mostly on manipulating text and regular expressions since that represents a large chunk of my daily work.

Updated on September 18, 2022

Comments

  • terdon
    terdon over 1 year

    A friend pasted a command into a Slack chat room which contained the character *. This looks like a normal * but isn't:

    $ uniprops '*​'
    uniprops: no character named ‹*​›
    

    While if I run uniprops on the asterisk I get when typing on my machine, I get:

    $ uniprops '*'
    U+002A ‹*› \N{ASTERISK}
        \pP \p{Po}
        All Any ASCII Assigned Basic_Latin Punct Is_Punctuation Common Zyyy Po P
           Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Other_Punctuation
           Pat_Syn Pattern_Syntax PatSyn POSIX_Graph POSIX_Print POSIX_Punct Print
           X_POSIX_Print Punctuation Unicode X_POSIX_Punct
    

    I can also see that it isn't an actual asterisk by passing it through od:

    $ printf '*​' | od -c
    0000000   * 342 200 213
    0000004
    

    While the normal one gives:

    $ printf '*' | od -c
    0000000   *
    0000001
    

    Here's the mystery character a bit larger:

    *​

    And the normal asterisk (yes, they do look identical):

    *

    So, uniprops doesn't know what this is, and I can't find it on http://www.fileformat.info/ either. I do know that the friend who pasted it is on OS X (I am on Linux) and that it works on their system as a regular asterisk. I am assuming that Slack somehow changed it. So, does anyone have any idea what that character is?

    Note that you can't copy the weird character directly from the question. Apparently, the Stack Exchange engine strips the trailing non-printing characters. Click on the "edit" link and copy from there instead.


    uniprops is a neat little script included in the Unicode::Tussle Perl module which identifies and prints information about the character you give it.

    • March Ho
      March Ho almost 8 years
      Cannot reproduce. I used ord("*") for your pasted string and the native * key, and got the same number for both (42).
    • terdon
      terdon almost 8 years
      @MarchHo damn, the SE engine seems to be eating it. I tested before posting and could copy the strange character (although, I am starting to understand that the problem is that there were extra, non-printing characters added there) but I can't copy from the posted question either. You need to click on the edit link and copy from there.
    • derobert
      derobert almost 8 years
      Oddly, on the Android app, the zero with space is displayed as if it were a normal space.
    • bodo
      bodo almost 8 years
      Interestingly, when I paste from ‘edit’ into my terminal urxvt, it is already displayed as *<200b>.
    • TessellatingHeckler
      TessellatingHeckler almost 8 years
      If you copy it from your code section, e.g. the uniprops line, then it copies OK without needing to go to the question source. (Pasting it into Python3 interpreter shows as '*\u200b' too)
  • deltab
    deltab almost 8 years
    Another way to do that is printf '\342\200\213' | uniname. (uniname is from the uniutils package.)
  • deltab
    deltab almost 8 years
    Replacing str with hex will output the codepoints in hexadecimal, making them easier to recognise or look up.
  • bodo
    bodo almost 8 years
    There is also a dedicated python module called unicodedata, with which you can query character names, category etc.
  • Monty Harder
    Monty Harder almost 8 years
    The ZERO WIDTH SPACE and ZERO WIDTH JOINER characters are handy to use with comment systems that try to block common spam terms. For instance, to point out that Bernie Sanders was elected to the Senate as a Socialist (without tripping a spam trap for "Cialis") write it as "Soci&zwj;alist" if HTML Entities are respected, or paste in the character from Character Map or equivalent if they aren't.
  • Hastur
    Hastur almost 8 years
    From this site you can have different format conversions: for HEX it gives 002A 200B, for utf-8 2A E2 80 8B for utf-16 002A 200B...