How do I find this character(by unicode search) in notepad++ ﻁ (\uFEC1 and only that character)
Solution 1
Regarding searching by UTF-16 code
To search by Unicode codepoints using UTF-16 you'd use \x{FEC1}
, and it works whether the file is encoded with UTF-8 or UTF-16.
Bear in mind you wouldn't need to search by the UTF-8 code, because you can search by the UTF-16 code. But to address the part of your question that asks how do you search for that character by the UTF-8 code...
Regarding searching by UTF-8 code
You can't. Well, you sort of can, but it's a hideous hack and you really shouldn't.
The obvious thing to try would be to search for \xef\xbb\x81
in your UTF-8 encoded document, but that doesn't work. (Note there's no {}
here: Notepad++ expects either \xNN
for 2 hex digits, or \x{NNNN}
for 4 hex digits). That's because Notepad++ doesn't actually search for byte values, it searches for Unicode codepoints. So you can search for the codepoint U+FEC1, but not for the UTF-8 bytes 0xEF 0xBB 0x81, because Notepad++ "hides" the encoding details from you. (Because in nearly every scenario, someone editing a text file will care far more about finding the actual character than about finding the UTF-8 bytes.)
There's another trick you might try, which is to take that UTF-8 encoded file and choose the Encoding → Encode in ANSI
menu option, at which point ﻁﻁﻉﻁﻉﻁﻉ
appears to become ï»ï»ï»‰ï»ï»‰ï»ï»‰
. (I say "appears to become" rather than "becomes" because... well, read on.) This is because it has taken the UTF-8 text of your file, and reinterpreted it as "ANSI" (which is a terrible encoding name because it's completely wrong, and should really be called "Windows-1252", but that's a different question). (By the way, the reason that ﻁﻁﻉﻁﻉﻁﻉ
looks backwards in my text than the way it does in your screenshot: that's because Notepad++ doesn't care that Arabic is written right-to-left, so it shows the characters left-to-right in the order they were pasted into the file. But your browser does care about presenting Arabic in proper right-to-left order, the first two letters of that string (ﻁﻁ
) appear on the right-hand side of the string, not on the left-hand side as they seem to in Notepad++). Digressions aside, here's why this will be helpful. In the "ANSI" (really Windows-1252) encoding, each byte is a single character, and so now you're going to be able to search by individual bytes. Now, if you search for \xef\xbb\x81
(which doesn't need to be a regular expression, just an "Extended" search), it will find the characters. Sort of. It will look like it's highlighting the two characters ï»
, but it's really highlighting three characters: ï
, »
, and an "invisible" 0x81
character that doesn't really exist. (Because there is no character at the 0x81
point in Windows-1252 encoding: see for yourself.) And now you see why I said "appears to become" -- because your UTF-8 encoded text has really become ï»_ï»_ﻉï»_ﻉï»_ﻉ
, where _
represents an "invisible" character that doesn't officially exist in the Windows-1252 codepage. Anyway, now that you've found the sequence of three characters with the byte values 0xEF, 0xBB, and 0x81 in Windows-1252, and Notepad++ has highlighted them, you can choose the Encoding → Encode in UTF-8
menu option, and your text will convert itself back to UTF-8, while Notepad++ will keep the highlight in the same place -- and thus, you'll find that one ﻁ
character has been highlighted.
So why do I say that you really shouldn't do this? Because the only reason that it works is that Notepad++ didn't do the right thing when you switched codepages. The right thing to do when you find a missing character is to complain, or insert a character like the Unicode replacement character �
(or a simple ?
if you're in a legacy codepage that doesn't have �
in it), or do something so that the user will know they had an invalid character in their text. Errors should never be silently ignored, and having a 0x81
value in Windows-1252 text is an error. The only reason this trick works is because Notepad++ does the wrong thing with invalid characters (that is, it ignores them). So you really shouldn't rely on this trick: with any update to Notepad++, it could change its undocumented (and wrong) behavior, and start putting proper replacement characters in wrongly-encoded text, at which point this trick would fail. Stick to searching for real Unicode codepoints, and you'll be much better off.
By the way, the reason why your original attempt ([\uFEC1]
) failed is because, according to Notepad++'s regular expression syntax, \u
means "an uppercase letter". (Remember that in regular expressions, brackets represent "any of these characters"). The docs further say, "See note about lower case [sic] letters," and the note about lowercase letters says "this will fall back on "a word character" if the "Match case" search option is off." As it is in your screenshot. Therefore, the regex [\uFEC1]
is searching for "any word character, or F, or E, or C, or 1" -- which matches every single character in your sample text.
Phew, that turned out to be a very long answer for what I said would be "very simple". I hope this helps you understand Unicode a bit better; if so, the hour I spent typing this up will have been worth it.
Solution 2
Take a look: Anyone know how to use Regex in notepad++ to find Arabic characters?
Because Notepad++'s implementation of Regular Expressions requires that you use the
\x{NNNN}
notation to match Unicode characters.
In your example,
\x{FEC1}
Related videos on Youtube
barlop
Updated on September 18, 2022Comments
-
barlop over 1 year
How do I find this character(by unicode search) in notepad++ ﻁ
If I go to charmap
and I pick this character
I type FEC1 in the unicode search box and hit ENTER and it finds the character
I look it up on fileformat.info
http://www.fileformat.info/info/unicode/char/fec1/index.htm
UTF-8 (hex) 0xEF 0xBB 0x81 (efbb81) UTF-16 (hex) 0xFEC1 (fec1)
If I enter the character into the search box literally then it finds it
But I can't see what unicode to search for to find it
I'd like to be able to search for it in both UTF-8 and UTF-16
[\uFEC1] seems to find the character, but it finds more than that character
Now, if I throw a few FEC9s in there, then I see [\uFEC1] seems to find them too
So, how do I search for \uFEC1 and only that. And i'm interested in searching for it by its UTF-8 code too
-
barlop over 8 yearsThanks, that works for UTF 16. Do you know if you can search with the UTF-8 code(that's the other part of my q)?
-
barlop over 8 yearsI didn't ask you if it works or doesn't work on your system. Notepad++ is Notepad++ so anything will work or not work for both of us. What I asked you (and it's in my question too) is if you can search with the UTF 8 code for that character, that is
UTF-8 (hex) 0xEF 0xBB 0x81 (efbb81)
-
barlop over 8 years-1 i'll have to downvote you for not grasping this and not even understanding that you haven't understood the second half of the question. I've been pretty clear that I was also asking about UTF-8 and you haven't understood or made any effort to even see that you do not understand. It's one thing to not understand something, and to ask, it's another thing to not understand it and be completely oblivious to not understanding it. You insist that you answered my question, but I told you, there are two parts to it and you did not answer the second half(despite your insistence that you have).
-
Leo Chapiro over 8 yearsI don't care about you downvoting - I just wanted to help you!
-
barlop over 8 years[If you are answering (which you did) THEN] you should just be trying to answer the question - and honestly.
-
barlop over 3 yearsYou write "That's because Notepad++ doesn't actually search for byte values, it searches for Unicode codepoints" and "Notepad++ doesn't actually search for byte values" <-- But text editors aside, unicode codepoints can be represented in UTF-8 or UTF-16. So the idea of searching for something using the UTF-8 encoded value, does not necessarily mean searching for bytes as stored, just like searching for something using the UTF-16 value doesn't necessarily mean searching for bytes as stored.