Convert unicode codepoint to string character in Ruby

ruby string unicode utf-8

16,055

Solution 1

How about:

# Using pack
puts ["2B71F".hex].pack("U")

# Using chr
puts (0x2B71F).chr(Encoding::UTF_8)

In Ruby 1.9+ you can also do:

puts "\u{2B71F}"

I.e. the \u{} escape sequence can be used to decode Unicode codepoints.

Solution 2

The unicode symbols like U+2B71F are referred to as a codepoint.

The unicode system defines a unique codepoint for each character in a multitude of world languages, scientific symbols, currencies etc. This character set is steadily growing.

For example, U+221E is infinity.

The codepoints are hexadecimal numbers. There is always exactly one number defined per character.

There are many ways to arrange this in memory. This is known as an encoding of which the common ones are UTF-8 and UTF-16. The conversion to and fro is well defined.

Here you are most probably looking for converting the unicode codepoint to UTF-8 characters.

codepoint = "U+2B71F"

You need to extract the hex part coming after U+ and get only 2B71F. This will be the first group capture. See this.

codepoint.to_s =~ /U\+([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})$/

And you're UTF-8 character will be:

utf_8_character = [$1.hex].pack("U")

References:

16,055

Author by

thenengah

I taught ESL after university, then I started making things for the internet. My favorite tools are mac, ubuntu, vim, tmux, bash, git, javascript/node/express, ruby/rails, react, redux, bootstrap, sass, webpack, gulp, babel, jest, mysql, mongodb, redis, neo4j, rabbitMQ, ELK, nginx, jenkins, AWS.

Updated on June 06, 2022

Comments

thenengah about 2 years
I have these values from a unicode database but I'm not sure how to translate them into the human readable form. What are these even called?

Here they are:
- U+2B71F
- U+2A52D
- U+2A68F
- U+2A690
- U+2B72F
- U+2B4F7
- U+2B72B
How can I convert these to there readable symbols?
Ocaj Nires almost 13 years

codepoint was "U+2B71F". To extract just "2B71F" from it, I match it against a unicode regex. There is one group defined in the regex for extracting "2B71F". After the match, if there is one you can refer to it with $1 in this case. Follow this rubular permalink to see the regex in action.
AJP over 11 years

One of the best answers regarding unicode, utf-8 code points, character sets, encoding etc I have ever read on SO... and the links are brilliant. joelonsoftware.com/articles/Unicode.html is particularly spot on.
Andrew Marshall over 11 years

You could also just use a hex literal: [0x2B71F].pack 'U'.