How to convert UTF8 byte arrays to string in lua

12,650

Solution 1

Lua doesn't provide a direct function for turning a table of utf-8 bytes in numeric form into a utf-8 string literal. But it's easy enough to write something for this with the help of string.char:

function utf8_from(t)
  local bytearr = {}
  for _, v in ipairs(t) do
    local utf8byte = v < 0 and (0xff + v + 1) or v
    table.insert(bytearr, string.char(utf8byte))
  end
  return table.concat(bytearr)
end

Note that none of lua's standard functions or provided string facilities are utf-8 aware. If you try to print utf-8 encoded string returned from the above function you'll just see some funky symbols. If you need more extensive utf-8 support you'll want to check out some of the libraries mention from the lua wiki.

Solution 2

Here's a comprehensive solution that works for the UTF-8 character set restricted by RFC 3629:

do
  local bytemarkers = { {0x7FF,192}, {0xFFFF,224}, {0x1FFFFF,240} }
  function utf8(decimal)
    if decimal<128 then return string.char(decimal) end
    local charbytes = {}
    for bytes,vals in ipairs(bytemarkers) do
      if decimal<=vals[1] then
        for b=bytes+1,2,-1 do
          local mod = decimal%64
          decimal = (decimal-mod)/64
          charbytes[b] = string.char(128+mod)
        end
        charbytes[1] = string.char(vals[2]+decimal)
        break
      end
    end
    return table.concat(charbytes)
  end
end

function utf8frompoints(...)
  local chars,arg={},{...}
  for i,n in ipairs(arg) do chars[i]=utf8(arg[i]) end
  return table.concat(chars)
end

print(utf8frompoints(72, 233, 108, 108, 246, 32, 8364, 8212))
--> Héllö €—
Share:
12,650
Tony
Author by

Tony

Updated on June 19, 2022

Comments

  • Tony
    Tony almost 2 years

    I have a table like this

    table = {57,55,0,15,-25,139,130,-23,173,148,-24,136,158}
    

    it is utf8 encoded byte array by php unpack function

    unpack('C*',$str);
    

    how can I convert it to utf-8 string I can read in lua?

  • Phrogz
    Phrogz over 9 years
    I've just replaced the old implementation with one that is far more elegant (uses no strings for the binary math), shorter, and consequently about 5 times faster, too.
  • Phrogz
    Phrogz over 9 years
    -1: does not handle 3- and 4-byte UTF8 characters like U+20AC -> €
  • Phrogz
    Phrogz over 9 years
    Additional optimizations (edited into the above) provide another 2x or more perf gains.
  • Алекс Денькин
    Алекс Денькин about 5 years
    How to use this func with string like so s="\xD0\x9C\xD0\xBE\xD1\x81\xD0\xBA\xD0\xB2\xD0\xB0"