What is this character: '*'?
Solution 1
The paste failed not because of the asterisk, which is a perfectly regular asterisk, but because of the Unicode character U+200B. As the character is a ZERO WIDTH SPACE
, it does not display when it is copied.
Using the Python code:
stro=u"'*'?"
def uniconv(text):
return " ".join(hex(ord(char)) for char in text)
uniconv(stro)
The function uniconv
converts the input string (in this case, u"'*'?"
) into their Unicode codepage equivalents in hexadecimal format. The u
prefix to the string identifies the string as a Unicode string.
I was able to obtain the output:
0x27 0x2a 0x200b 0x27 0x3f
We can clearly see that 0x27
, 0x2a
and 0x3f
are the ASCII/Unicode hexadecimal values for the characters '
,*
and ?
respectively. That leaves 0x200b
, therefore identifying the character.
Note that the Python code, when pasted into the body, had the U+200B character removed by SE's Markdown software. In order to obtain the expected result, you need to copy it directly from the title using the Edit view.
Solution 2
With the help of @Rinzwind in the Ask Ubuntu chat room, I figured out that the problem isn't the character at all. Note the output of od
:
$ printf '*' | od -c
0000000 * 342 200 213
0000004
The 342 200 213
is an octal representation of another character and we can use this site to look it up:
Character
Character name ZERO WIDTH SPACE
Hex code point 200B
Decimal code point 8203
Hex UTF-8 bytes E2 80 8B
Octal UTF-8 bytes 342 200 213
UTF-8 bytes as Latin-1 characters bytes â <80> <8B>
So, what I actually had was two unicode characters, the normal *
and a zero width space.
Related videos on Youtube
terdon
Elected moderator on Unix & Linux. I've been using Linux since the late '90s and have gone through a variety of distributions. At one time or another, I've been a user of Mandrake, SuSe, openSuSe, Fedora, RedHat, Ubuntu, Mint, Linux Mint Debian Edition (basically Debian testing but more green) and, for the past few years, Arch. My Linux expertise, such as it is, is mostly on manipulating text and regular expressions since that represents a large chunk of my daily work.
Updated on September 18, 2022Comments
-
terdon over 1 year
A friend pasted a command into a Slack chat room which contained the character
*
. This looks like a normal*
but isn't:$ uniprops '*' uniprops: no character named ‹*›
While if I run
uniprops
on the asterisk I get when typing on my machine, I get:$ uniprops '*' U+002A ‹*› \N{ASTERISK} \pP \p{Po} All Any ASCII Assigned Basic_Latin Punct Is_Punctuation Common Zyyy Po P Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Other_Punctuation Pat_Syn Pattern_Syntax PatSyn POSIX_Graph POSIX_Print POSIX_Punct Print X_POSIX_Print Punctuation Unicode X_POSIX_Punct
I can also see that it isn't an actual asterisk by passing it through
od
:$ printf '*' | od -c 0000000 * 342 200 213 0000004
While the normal one gives:
$ printf '*' | od -c 0000000 * 0000001
Here's the mystery character a bit larger:
*
And the normal asterisk (yes, they do look identical):
*
So,
uniprops
doesn't know what this is, and I can't find it on http://www.fileformat.info/ either. I do know that the friend who pasted it is on OS X (I am on Linux) and that it works on their system as a regular asterisk. I am assuming that Slack somehow changed it. So, does anyone have any idea what that character is?Note that you can't copy the weird character directly from the question. Apparently, the Stack Exchange engine strips the trailing non-printing characters. Click on the "edit" link and copy from there instead.
uniprops
is a neat little script included in theUnicode::Tussle
Perl module which identifies and prints information about the character you give it.-
March Ho almost 8 yearsCannot reproduce. I used
ord("*")
for your pasted string and the native*
key, and got the same number for both (42). -
terdon almost 8 years@MarchHo damn, the SE engine seems to be eating it. I tested before posting and could copy the strange character (although, I am starting to understand that the problem is that there were extra, non-printing characters added there) but I can't copy from the posted question either. You need to click on the edit link and copy from there.
-
derobert almost 8 yearsOddly, on the Android app, the zero with space is displayed as if it were a normal space.
-
bodo almost 8 yearsInterestingly, when I paste from ‘edit’ into my terminal
urxvt
, it is already displayed as*<200b>
. -
TessellatingHeckler almost 8 yearsIf you copy it from your code section, e.g. the uniprops line, then it copies OK without needing to go to the question source. (Pasting it into Python3 interpreter shows as
'*\u200b'
too)
-
-
deltab almost 8 yearsAnother way to do that is
printf '\342\200\213' | uniname
. (uniname is from the uniutils package.) -
deltab almost 8 yearsReplacing
str
withhex
will output the codepoints in hexadecimal, making them easier to recognise or look up. -
bodo almost 8 yearsThere is also a dedicated python module called
unicodedata
, with which you can query character names, category etc. -
Monty Harder almost 8 yearsThe ZERO WIDTH SPACE and ZERO WIDTH JOINER characters are handy to use with comment systems that try to block common spam terms. For instance, to point out that Bernie Sanders was elected to the Senate as a Socialist (without tripping a spam trap for "Cialis") write it as "Soci‍alist" if HTML Entities are respected, or paste in the character from Character Map or equivalent if they aren't.
-
Hastur almost 8 yearsFrom this site you can have different format conversions: for HEX it gives
002A 200B
, for utf-82A E2 80 8B
for utf-16002A 200B
...