How do I escape a Unicode string with Ruby?


Solution 1

In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.

>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"

>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""

>> puts multi_byte_str.inspect
=> nil

In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:

=> "\"hello\\xD8\\xB9!\""

In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):

>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"

Solution 2

To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see

Solution 3

If you have Rails kicking around you can use the JSON encoder for this:

require 'active_support'
x = ActiveSupport::JSON.encode('µ')
# x is now "\u00b5"

The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.

Solution 4

There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.

Finding the value:

Method 1a: from Ruby with String#dump:

If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt that contains some UTF-8 encoded data in it – say, the currency symbols €£¥$ (plus a trailing newline) – running the following code (executed either in irb or as a script):

s ="unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.

... should print out:


Thus you can see that is U+20AC, £ is U+00A3, and ¥ is U+00A5. ($ is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table – or reference one that already does so.)

(Note: a previous answer suggested using #inspect instead of #dump. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect' prints an unhappy face for me, rather than an escape sequence. Changing inspect to dump, though, gets me the escape sequence back.)

Method 1b: with Ruby using String#encode and rescue:

Now, if you're trying the above with a larger input file, the above may prove unwieldy – it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:

encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
    c.encode("ASCII") # try to encode it to ASCII
  rescue Encoding::UndefinedConversionError # but if that fails
    encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
  puts "#{char} encodes to #{dumped}."

With the same input as above, this would then print:

€ encodes to "\u20AC".
£ encodes to "\u00A3".
¥ encodes to "\u00A5".

Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of 🙋🏾 ў ў, the output would be:

🙋 encodes to "\u{1F64B}".
🏾 encodes to "\u{1F3FE}".
ў encodes to "\u045E".
у encodes to "\u0443".                                                                                                      ̆
 encodes to "\u0306".

This is because 🙋🏾 is actually encoded as two code points: a base character (🙋 - U+1F64B), with a modifier (🏾, U+1F3FE; see also). Similarly with one of the letters: the first, ў, is a single pre-combined code point (U+045E), while the second, ў – though it looks the same – is formed by combining у (U+0443) with the modifier ̆ (U+0306 - which may or may not render properly, including on this page, since it's not meant to stand alone). So, depending on what you're doing, you may need to watch out for such things (which I leave as an exercise for the reader).

Method 2a: from web-based tools: specific characters:

Alternatively, if you have, say, an e-mail with a character in it, and you want to find the code point value to encode, if you simply do a web search for that character, you'll frequently find a variety of pages that give unicode details for the particular character. For example, if I do a google search for , I get, among other things, a wiktionary entry, a wikipedia page, and a page on, which I find to be a useful site for getting details on specific unicode characters. And each of those pages lists the fact that that check mark is represented by unicode code point U+2713. (Incidentally, searching in that direction works well, too.)

Method 2b: from web-based tools: by name/concept:

Similarly, one can search for unicode symbols to match a particular concept. For example, I searched above for unicode check marks, and even on the Google snippet there was a listing of several code points with corresponding graphics, though I also find this list of several check mark symbols, and even a "list of useful symbols" which has a bunch of things, including various check marks.

This can similarly be done for accented characters, emoticons, etc. Just search for the word "unicode" along with whatever else you're looking for, and you'll tend to get results that include pages that list the code points. Which then brings us to putting that back into ruby:

Representing the value, once you have it:

The Ruby documentation for string literals describes two ways to represent unicode characters as escape sequences:

\unnnn Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])

\u{nnnn ...} Unicode character(s), where each nnnn is 1-6 hexadecimal digits ([0-9a-fA-F])

So for code points with a 4-digit representation, e.g. U+2713 from above, you'd enter (within a string literal that's not in single quotes) this as \u2713. And for any unicode character (whether or not it fits in 4 digits), you can use braces ({ and }) around the full hex value for the code point, e.g. \u{1f60d} for 😍. This form can also be used to encode multiple code points in a single escape sequence, separating characters with whitespace. For example, \u{1F64B 1F3FE} would result in the base character 🙋 plus the modifier 🏾, thus ultimately yielding the abstract character 🙋🏾 (as seen above).

This works with shorter code points, too. For example, that currency character string from above (€£¥$) could be represented with \u{20AC A3 A5 24} – requiring only 2 digits for three of the characters.

Solution 5

You can directly use unicode characters if you just add #Encoding: UTF-8 to the top of your file. Then you can freely use ä, ǹ, ú and so on in your source code.

Author by


Updated on April 07, 2020


  • Kadarach
    Kadarach about 4 years

    I need to encode/convert a Unicode string to its escaped form, with backslashes. Anybody know how?

  • Dave
    Dave about 12 years
    For Ruby 1.8, you can use ["XXXX".to_i(16)].pack("U*")
  • Steve Benner
    Steve Benner almost 11 years
    This was super helpful! I was about to write it myself, and it saved me time, elegant composition sir. I used this to encode some hints for a little CSS tutorial I made on Codepen, so they aren't visible to the user until being parsed into JSON! check it out!
  • David Makogon
    David Makogon almost 7 years
    This is your own repo (or one which you're a primary contributor to). Which you didn't disclose. So basically it's spam. And it doesn't answer the question.
  • lindes
    lindes about 5 years
    @Trejkaz: I had the same question. The linked document actually shows an example of it: use { and } around the code, e.g. \u{1f60d} kind of expresses how I felt about figuring out how to express these things. :D
  • lindes
    lindes about 5 years
    Upvoted this answer because pieces of it were helpful to me, but it may be worth pointing out that #inspect doesn't always give you what you need. #dump should do the trick, though. See also a new answer that I somehow felt inspired to write.
  • Hakanai
    Hakanai about 5 years
    @lindes yeah, also worth noting that the pack function also works for ["1f60d".to_i(16)].pack("U*"). It isn't immediately obvious that it would. :)
  • lindes
    lindes about 5 years
    Ah, yes. So it does . Makes sense, since it's just getting integers. So the key here really is that the "XXXX" in @Dave's comment is not constrained to being 4 digits (could be fewer or more), even while the one in this answer is.
  • Shelvacu
    Shelvacu over 4 years
    This breaks on codepoints above U+FFFF, like U+1F92E 🤮, the syntax should be "\u{1f92e}" not "\u1f92e" (which gives ᾒe)