String to byte array in UTF-8?

10,135

Solution 1

A function like this will do what you need:

function UTF8Bytes(const s: UTF8String): TBytes;
begin
  Assert(StringElementSize(s)=1);
  SetLength(Result, Length(s));
  if Length(Result)>0 then
    Move(s[1], Result[0], Length(s));
end;

You can call it with any type of string and the RTL will convert from the encoding of the string that is passed to UTF-8. So don't be tricked into thinking you must convert to UTF-8 before calling, just pass in any string and let the RTL do the work.

After that it's a fairly standard array copy. Note the assertion that explicitly calls out the assumption on string element size for a UTF-8 encoded string.

If you want to get the zero-terminator you would write it so:

function UTF8Bytes(const s: UTF8String): TBytes;
begin
  Assert(StringElementSize(s)=1);
  SetLength(Result, Length(s)+1);
  if Length(Result)>0 then
    Move(s[1], Result[0], Length(s));
  Result[high(Result)] := 0;
end;

Solution 2

You can use TEncoding.UTF8.GetBytes in SysUtils.pas

Solution 3

If you're using Delphi 2009 or later (the Unicode versions), converting a WideString to a UTF8String is a simple assignment statement:

var
  ws: WideString;
  u8s: UTF8String;

u8s := ws;

The compiler will call the right library function to do the conversion because it knows that values of type UTF8String have a "code page" of CP_UTF8.

In Delphi 7 and later, you can use the provided library function Utf8Encode. For even earlier versions, you can get that function from other libraries, such as the JCL.

You can also write your own conversion function using the Windows API:

function CustomUtf8Encode(const ws: WideString): UTF8String;
var
  n: Integer;
begin
  n := WideCharToMultiByte(cp_UTF8, 0, PWideChar(ws), Length(ws), nil, 0, nil, nil);
  Win32Check(n <> 0);
  SetLength(Result, n);
  n := WideCharToMultiByte(cp_UTF8, 0, PWideChar(ws), Length(ws), PAnsiChar(Result), n, nil, nil);
  Win32Check(n = Length(Result));
end;

A lot of the time, you can simply use a UTF8String as an array, but if you really need a byte array, you can use David's and Cosmin's functions. If you're writing your own character-conversion function, you can skip the UTF8String and go directly to a byte array; just change the return type to TBytes or array of Byte. (You may also wish to increase the length by one, if you want the array to be null-terminated. SetLength will do that to the string implicitly, but to an array.)

If you have some other string type that's neither WideString, UnicodeString, nor UTF8String, then the way to convert it to UTF-8 is to first convert it to WideString or UnicodeString, and then convert it back to UTF-8.

Solution 4

var S: UTF8String;
    B: TBytes;

begin
  S := 'Șase sași în șase saci';
  SetLength(B, Length(S)); // Length(s) = 26 for this 22 char string.
  CopyMemory(@B[0], @S[1], Length(S));
end.

Depending on what you need the bytes for, you might want to include an NULL terminator.

For production code make sure you test for empty string. Adding the 3-4 LOC required would just make the sample harder to read.

Solution 5

I have the following two routines (source code can be downloaded here - http://www.csinnovations.com/framework_utilities.htm):

function CsiBytesToStr(const pInData: TByteDynArray; pStringEncoding: TECsiStringEncoding; pIncludesBom: Boolean): string;

function CsiStrToBytes(const pInStr: string; pStringEncoding: TECsiStringEncoding; pIncludeBom: Boolean): TByteDynArray;

Share:
10,135
Mariusz
Author by

Mariusz

Updated on June 05, 2022

Comments

  • Mariusz
    Mariusz almost 2 years

    How to convert a WideString (or other long string) to byte array in UTF-8?

  • Cosmin Prund
    Cosmin Prund about 13 years
    The string is not empty. It contains the value 'Șase sași în șase saci'
  • Andreas Rejbrand
    Andreas Rejbrand about 13 years
    +1. Not everyone (to say the least!) knows how the Length function really works!
  • David Heffernan
    David Heffernan about 13 years
    @Cosmin I can see that the string is not empty. I just have a feeling that the OP may be interested in text other than 'Șase sași în șase saci'.
  • Andreas Rejbrand
    Andreas Rejbrand about 13 years
    @Cosmin, @David: Surely @Cosmin was joking! (Indeed, David's point is very good.)
  • Mariusz
    Mariusz about 13 years
    I want to send the bytes to my Java app thru the sockets.
  • David Heffernan
    David Heffernan about 13 years
    @Cosmin No it will not. That's the thing about assertions!
  • Mariusz
    Mariusz about 13 years
    one question.. what unit do I have to add to use StringElementSize()?(lazarus). Sorry for such questions, im a newbie
  • David Heffernan
    David Heffernan about 13 years
    @Mariusz What does your "lazarus" statement mean? You tagged the question Delphi. In Delphi it's in system.pas and so automatically used by all units.
  • Andreas Rejbrand
    Andreas Rejbrand about 13 years
    @Mariusz: You can remove the entire Assert... line. But since you tagged your question Delphi, and not free-pascal, @David's answer applies to Delphi, and not Free Pascal. But the code above might work in Free Pascal, too. I don't know. Try it.
  • Rob Kennedy
    Rob Kennedy about 13 years
    Note that if the input string is already encoded as UTF-8, GetBytes will be very wasteful. The compiler will convert the input string to UnicodeString since that's the only string argument GetBytes allows, and the GetBytes will convert the characters back to UTF-8 to generate its result.
  • Marco van de Voort
    Marco van de Voort about 13 years
    It is D2009+ specific code, and thus will not work on FPC which follows pre D2009 semantics. Passing a widestring (see original question) to a "UTF8string" will change it to the local encoding (NOT UTF-8 like in D2009+), and thus garble the string. FPC has special documented functions for this, see separate answer