String to byte array in UTF-8?
Solution 1
A function like this will do what you need:
function UTF8Bytes(const s: UTF8String): TBytes;
begin
Assert(StringElementSize(s)=1);
SetLength(Result, Length(s));
if Length(Result)>0 then
Move(s[1], Result[0], Length(s));
end;
You can call it with any type of string and the RTL will convert from the encoding of the string that is passed to UTF-8. So don't be tricked into thinking you must convert to UTF-8 before calling, just pass in any string and let the RTL do the work.
After that it's a fairly standard array copy. Note the assertion that explicitly calls out the assumption on string element size for a UTF-8 encoded string.
If you want to get the zero-terminator you would write it so:
function UTF8Bytes(const s: UTF8String): TBytes;
begin
Assert(StringElementSize(s)=1);
SetLength(Result, Length(s)+1);
if Length(Result)>0 then
Move(s[1], Result[0], Length(s));
Result[high(Result)] := 0;
end;
Solution 2
You can use TEncoding.UTF8.GetBytes
in SysUtils.pas
Solution 3
If you're using Delphi 2009 or later (the Unicode versions), converting a WideString to a UTF8String is a simple assignment statement:
var
ws: WideString;
u8s: UTF8String;
u8s := ws;
The compiler will call the right library function to do the conversion because it knows that values of type UTF8String have a "code page" of CP_UTF8
.
In Delphi 7 and later, you can use the provided library function Utf8Encode
. For even earlier versions, you can get that function from other libraries, such as the JCL.
You can also write your own conversion function using the Windows API:
function CustomUtf8Encode(const ws: WideString): UTF8String;
var
n: Integer;
begin
n := WideCharToMultiByte(cp_UTF8, 0, PWideChar(ws), Length(ws), nil, 0, nil, nil);
Win32Check(n <> 0);
SetLength(Result, n);
n := WideCharToMultiByte(cp_UTF8, 0, PWideChar(ws), Length(ws), PAnsiChar(Result), n, nil, nil);
Win32Check(n = Length(Result));
end;
A lot of the time, you can simply use a UTF8String as an array, but if you really need a byte array, you can use David's and Cosmin's functions. If you're writing your own character-conversion function, you can skip the UTF8String and go directly to a byte array; just change the return type to TBytes
or array of Byte
. (You may also wish to increase the length by one, if you want the array to be null-terminated. SetLength will do that to the string implicitly, but to an array.)
If you have some other string type that's neither WideString, UnicodeString, nor UTF8String, then the way to convert it to UTF-8 is to first convert it to WideString or UnicodeString, and then convert it back to UTF-8.
Solution 4
var S: UTF8String;
B: TBytes;
begin
S := 'Șase sași în șase saci';
SetLength(B, Length(S)); // Length(s) = 26 for this 22 char string.
CopyMemory(@B[0], @S[1], Length(S));
end.
Depending on what you need the bytes for, you might want to include an NULL terminator.
For production code make sure you test for empty string. Adding the 3-4 LOC required would just make the sample harder to read.
Solution 5
I have the following two routines (source code can be downloaded here - http://www.csinnovations.com/framework_utilities.htm):
function CsiBytesToStr(const pInData: TByteDynArray; pStringEncoding: TECsiStringEncoding; pIncludesBom: Boolean): string;
function CsiStrToBytes(const pInStr: string; pStringEncoding: TECsiStringEncoding; pIncludeBom: Boolean): TByteDynArray;
Mariusz
Updated on June 05, 2022Comments
-
Mariusz almost 2 years
How to convert a WideString (or other long string) to byte array in UTF-8?
-
Cosmin Prund about 13 yearsThe string is not empty. It contains the value
'Șase sași în șase saci'
-
Andreas Rejbrand about 13 years+1. Not everyone (to say the least!) knows how the
Length
function really works! -
David Heffernan about 13 years@Cosmin I can see that the string is not empty. I just have a feeling that the OP may be interested in text other than
'Șase sași în șase saci'
. -
Andreas Rejbrand about 13 years@Cosmin, @David: Surely @Cosmin was joking! (Indeed, David's point is very good.)
-
Mariusz about 13 yearsI want to send the bytes to my Java app thru the sockets.
-
David Heffernan about 13 years@Cosmin No it will not. That's the thing about assertions!
-
Mariusz about 13 yearsone question.. what unit do I have to add to use StringElementSize()?(lazarus). Sorry for such questions, im a newbie
-
David Heffernan about 13 years@Mariusz What does your "lazarus" statement mean? You tagged the question Delphi. In Delphi it's in system.pas and so automatically used by all units.
-
Andreas Rejbrand about 13 years@Mariusz: You can remove the entire
Assert...
line. But since you tagged your questionDelphi
, and notfree-pascal
, @David's answer applies to Delphi, and not Free Pascal. But the code above might work in Free Pascal, too. I don't know. Try it. -
Rob Kennedy about 13 yearsNote that if the input string is already encoded as UTF-8,
GetBytes
will be very wasteful. The compiler will convert the input string to UnicodeString since that's the only string argumentGetBytes
allows, and theGetBytes
will convert the characters back to UTF-8 to generate its result. -
Marco van de Voort about 13 yearsIt is D2009+ specific code, and thus will not work on FPC which follows pre D2009 semantics. Passing a widestring (see original question) to a "UTF8string" will change it to the local encoding (NOT UTF-8 like in D2009+), and thus garble the string. FPC has special documented functions for this, see separate answer