Converting UnicodeString to AnsiString
Solution 1
In this particular case, using RawByteString
is an appropriate solution:
function WideStringToString(const Source: UnicodeString; CodePage: UINT): RawByteString;
var
strLen: Integer;
begin
strLen := LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), nil, 0, nil, nil));
if strLen > 0 then
begin
SetLength(Result, strLen);
LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), PAnsiChar(Result), strLen, nil, nil));
SetCodePage(Result, CodePage, False);
end;
end;
This way, the RawByteString
holds the codepage, and assigning the RawByteString
to any other string type, whether that be AnsiString
or UTF8String
or whatever, will allow the RTL to automatically convert the RawByteString
data from its current codepage to the destination string's codepage (which includes conversions to UnicodeString
).
If you absolutely must return an AnsiString
(which I do not recommend), you can still use SetCodePage()
via a typecast:
function WideStringToString(const Source: UnicodeString; CodePage: UINT): AnsiString;
var
strLen: Integer;
begin
strLen := LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), nil, 0, nil, nil));
if strLen > 0 then
begin
SetLength(Result, strLen);
LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), PAnsiChar(Result), strLen, nil, nil));
SetCodePage(PRawByteString(@Result)^, CodePage, False);
end;
end;
The reverse is much easier, just use the codepage already stored in a (Ansi|RawByte)String
(just make sure those codepages are always accurate), since the RTL already knows how to retrieve and use the codepage for you:
function StringToWideString(const Source: AnsiString): UnicodeString;
begin
Result := UnicodeString(Source);
end;
function StringToWideString(const Source: RawByteString): UnicodeString;
begin
Result := UnicodeString(Source);
end;
That being said, I would suggest dropping the helper functions altogether and just use typed strings instead. Let the RTL handle conversions for you:
type
Win1252String = type AnsiString(1252);
var
s: UnicodeString;
a: Win1252String;
begin
s := 'Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ';
a := Win1252String(s);
s := UnicodeString(a);
end;
var
s: UnicodeString;
u: UTF8String;
begin
s := 'Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ';
u := UTF8String(s);
s := UnicodeString(u);
end;
Solution 2
I think that returning a RawByteString
is probably as good as you'll get. You could do it using AnsiString
as you outlined but RawByteString
captures the intent better. In this scenario a RawByteString
morally counts as a parameter in the sense of the official Embarcadero advice. It is just an output rather than an input. The real key is not to use it as a variable.
You could code it like this:
function MBCSString(const s: UnicodeString; CodePage: Word): RawByteString;
var
enc: TEncoding;
bytes: TBytes;
begin
enc := TEncoding.GetEncoding(CodePage);
try
bytes := enc.GetBytes(s);
SetLength(Result, Length(bytes));
Move(Pointer(bytes)^, Pointer(Result)^, Length(bytes));
SetCodePage(Result, CodePage, False);
finally
enc.Free;
end;
end;
Then
var
s: AnsiString;
....
s := MBCSString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 1252);
Writeln(StringCodePage(s));
s := MBCSString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 1251);
Writeln(StringCodePage(s));
s := MBCSString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 65001);
Writeln(StringCodePage(s));
outputs 1252, 1251, and then 65001 as you would expect.
And you could use LocaleCharsFromUnicode
if you prefer. Of course, you need to take its documentation with a pinch of salt: LocaleCharsFromUnicode is a wrapper for the WideCharToMultiByte function. Amazing that text was ever written since LocaleCharsFromUnicode
surely only exists to be cross-platform.
However, I wonder if you may be making a mistake in attempting to keep ANSI encoded text in AnsiString
variables in your program. Normally you would encoded to ANSI as late as possible (at the interop boundary), and likewise decode as early as possible.
If you simply have to do this then perhaps there is a better solution that avoids the dreaded AnsiString
completely. Instead of storing the text in an AnsiString
, store it in TBytes
. You already have data structures that keep track of encoding, so why not keep them. Replace the record that contains code page and AnsiString
with one containing code page and TBytes
. Then you would have no fear of anything recoding your text behind your back. And your code will be ready to use on the mobile compilers.
Solution 3
Grovelling through System.pas
, i found the built-in function SetAnsiString
that does what i want:
procedure SetAnsiString(Dest: _PAnsiStr; Source: PWideChar; Length: Integer; CodePage: Word);
It's also important to note that this function does push the CodePage into the internal StrRec structure for me:
PStrRec(PByte(Dest) - SizeOf(StrRec)).codePage := CodePage;
This allows me to write something like:
function WideStringToString(const s: UnicodeString; DestinationCodePage: Word): AnsiString;
var
strLen: Integer;
begin
strLen := Length(Source);
if strLen = 0 then
begin
Result := '';
Exit;
end;
//Delphi XE6 has a function to convert a unicode string to a tagged AnsiString
SetAnsiString(@Result, @Source[1], strLen, DestinationCodePage);
end;
So when i call:
actual := WideStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 850);
i get the resulting AnsiString
:
codePage: $0352 (850)
elemSize: $0001 (1)
refCnt: $00000001 (1)
length: $0000002C (44)
contents: 'The qùíçk brown fôx jumped ovêr the láZÿ dog'
An AnsiString with the appropriate code-page already stuffed in the secret codePage
member.
The other way
class function TUnicodeHelper.ByteStringToUnicode(const Source: RawByteString; CodePage: UINT): UnicodeString;
var
wideLen: Integer;
dw: DWORD;
begin
{
See http://msdn.microsoft.com/en-us/library/dd317756.aspx
Code Page Identifiers
for a list of code pages supported in Windows.
Some common code pages are:
CP_UTF8 (65001) utf-8 "Unicode (UTF-8)"
CP_ACP (0) The system default Windows ANSI code page.
CP_OEMCP (1) The current system OEM code page.
1252 Windows-1252 "ANSI Latin 1; Western European (Windows)", this is what most of us in north america use in Windows
437 IBM437 "OEM United States", this is your "DOS fonts"
850 ibm850 "OEM Multilingual Latin 1; Western European (DOS)", the format accepted by Fincen for LCTR/STR
28591 iso-8859-1 "ISO 8859-1 Latin 1; Western European (ISO)", Windows-1252 is a super-set of iso-8859-1, adding things like euro symbol, bullet and ellipses
20127 us-ascii "US-ASCII (7-bit)"
}
if Length(Source) = 0 then
begin
Result := '';
Exit;
end;
// Determine real size of final, string in symbols
// wideLen := MultiByteToWideChar(CodePage, 0, PAnsiChar(Source), Length(Source), nil, 0);
wideLen := UnicodeFromLocaleChars(CodePage, 0, PAnsiChar(Source), Length(Source), nil, 0);
if wideLen = 0 then
begin
dw := GetLastError;
raise EConvertError.Create('[StringToWideString] Could not get wide length of UTF-16 string. Error '+IntToStr(dw)+' ('+SysErrorMessage(dw)+')');
end;
// Allocate memory for UTF-16 string
SetLength(Result, wideLen);
// Convert source string to UTF-16 (WideString)
// wideLen := MultiByteToWideChar(CodePage, 0, PAnsiChar(Source), Length(Source), PWChar(wideStr), wideLen);
wideLen := UnicodeFromLocaleChars(CodePage, 0, PAnsiChar(Source), Length(Source), PWChar(Result), wideLen);
if wideLen = 0 then
begin
dw := GetLastError;
raise EConvertError.Create('[StringToWideString] Could not convert string to UTF-16. Error '+IntToStr(dw)+' ('+SysErrorMessage(dw)+')');
end;
end;
Note: Any code released into public domain. No attribution required.
Related videos on Youtube
mistertodd
Any code is public domain. No attribution required. జ్ఞా <sup>🕗</sup>🕗 Yes, i do write i with a lowercase i. The Meta Stackexchange answer that I am most proud of
Updated on January 15, 2020Comments
-
mistertodd over 4 years
In the olden times, i had a function that would convert a
WideString
to anAnsiString
of the specified code-page:function WideStringToString(const Source: WideString; CodePage: UINT): AnsiString; ... begin ... // Convert source UTF-16 string (WideString) to the destination using the code-page strLen := WideCharToMultiByte(CodePage, 0, PWideChar(Source), Length(Source), //Source PAnsiChar(cpStr), strLen, //Destination nil, nil); ... end;
And everything worked. I passed the function a unicode string (i.e. UTF-16 encoded data) and converted it to an
AnsiString
, with the understanding that the bytes in theAnsiString
represented characters from the specified code-page.For example:
TUnicodeHelper.WideStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 1252);
would return the
Windows-1252
encoded string:The qùíçk brown fôx jumped ovêr the lázÿ dog
Note: Information was of course lost during the conversion from the full Unicode character set to the limited confines of the Windows-1252 code page:
-
Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ
(before) -
The qùíçk brown fôx jumped ovêr the lázÿ dog
(after)
But the Windows
WideChartoMultiByte
does a pretty good job of best-fit mapping; as it is designed to do.Now the after times
Now we are in the after times.
WideString
is now a pariah, withUnicodeString
being the goodness. It's an inconsequential change; as the Windows function only needed a pointer to a series ofWideChar
anyway (which aUnicodeString
also is). So we change the declaration to useUnicodeString
instead:funtion WideStringToString(const Source: UnicodeString; CodePage: UINT): AnsiString; begin ... end;
Now we come to the return value. i have an
AnsiString
that contains the bytes:54 68 65 20 71 F9 ED E7 The qùíç 6B 20 62 72 6F 77 6E 20 k brown 66 F4 78 20 6A 75 6D 70 fôx jump 65 64 20 6F 76 EA 72 20 ed ovêr 74 68 65 20 6C E1 7A FF the lázÿ 20 64 6F 67 dog
In the olden times that was fine. I kept track of what code-page the
AnsiString
actually contained; i had to remember that the returnedAnsiString
was not encoded using the computer's locale (e.g. Windows 1258), but instead is encoded using another code-page (theCodePage
code page).But in Delphi XE6 an
AnsiString
also secretly contains the codepage:- codePage: 1258
- length: 44
-
value:
The qùíçk brown fôx jumped ovêr the lázÿ dog
This code-page is wrong. Delphi is specifying the code-page of my computer, rather than the code-page that the string is. Technically this is not a problem, i always understood that the
AnsiString
was in a particular code-page, i just had to be sure to pass that information along.So when i wanted to decode the string, i had to pass along the code-page with it:
s := TUnicodeHeper.StringToWideString(s, 1252);
with
function StringToWideString(s: AnsiString; CodePage: UINT): UnicodeString; begin ... MultiByteToWideChar(...); ... end;
Then one person screws everything up
The problem was that in the olden times i declared a type called
Utf8String
:type Utf8String = type AnsiString;
Because it was common enough to have:
function TUnicodeHelper.WideStringToUtf8(const s: UnicodeString): Utf8String; begin Result := WideStringToString(s, CP_UTF8); end;
and the reverse:
function TUnicodeHelper.Utf8ToWideString(const s: Utf8String): UnicodeString; begin Result := StringToWideString(s, CP_UTF8); end;
Now in XE6 i have a function that takes a
Utf8String
. If some existing code somewhere were take a UTF-8 encodedAnsiString
, and try to convert it to UnicodeString usingUtf8ToWideString
it would fail:s: AnsiString; s := UnicodeStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', CP_UTF8); ... ws: UnicodeString; ws := Utf8ToWideString(s); //Delphi will treat s an CP1252, and convert it to UTF8
Or worse, is the breadth of existing code that does:
s: Utf8String; s := UnicodeStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', CP_UTF8);
The returned string will become totally mangled:
- the function returns
AnsiString(1252)
(AnsiString
tagged as encoded using the current codepage) - the return result is being stored in an
AnsiString(65001)
string (Utf8String
) - Delphi converts the UTF-8 encoded string into UTF-8 as though it was 1252.
How to move forward
Ideally my
UnicodeStringToString(string, codePage)
function (which returns anAnsiString
) could set theCodePage
inside the string to match the actual code-page using something likeSetCodePage
:function UnicodeStringToString(s: UnicodeString; CodePage: UINT): AnsiString; begin ... WideCharToMultiByte(...); ... //Adjust the codepage contained in the AnsiString to match reality //SetCodePage(Result, CodePage, False); SetCodePage only works on RawByteString if Length(Result) > 0 then PStrRec(PByte(Result) - SizeOf(StrRec)).codePage := CodePage; end;
Except that manually mucking around with the internal structure of an
AnsiString
is horribly dangerous.So what about returning
RawByteString
?It has been said, over an over, by a lot of people who aren't me that
RawByteString
is meant to be the universal recipient; it wasn't meant to be as a return parameter:function UnicodeStringToString(s: UnicodeString; CodePage: UINT): RawByteString; begin ... WideCharToMultiByte(...); ... //Adjust the codepage contained in the AnsiString to match reality SetCodePage(Result, CodePage, False); SetCodePage only works on RawByteString end;
This has the virtue of being able to use the supported and documented
SetCodePage
.But if we're going to cross a line, and start returning
RawByteString
, surely Delphi already has a function that can convert aUnicodeString
to aRawByteString
string and vice versa:function WideStringToString(const s: UnicodeString; CodePage: UINT): RawByteString; begin Result := SysUtils.Something(s, CodePage); end; function StringToWideString(const s: RawByteString; CodePage: UINT): UnicodeString; begin Result := SysUtils.SomethingElse(s, CodePage); end;
But what is it?
Or what else should i do?
This was a long-winded set of background for a trivial question. The real question is, of course, what should i be doing instead? There is a lot of code out there that depends on the
UnicodeStringToString
and the reverse.tl;dr:
I can convert a
UnicodeString
to UTF by doing:Utf8Encode('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ');
and i can convert a
UnicodeString
to the current code-page by using:AnsiString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ');
But how do i convert a
UnicodeString
to an arbitrary (unspecified) code-page?My feeling is that since everything really is an
AnsiString
:Utf8String = AnsiString(65001); RawByteString = AnsiString(65535);
i should bite the bullet, bust open the
AnsiString
structure, and poke the correct code-page into it:function StringToAnsi(const s: UnicodeString; CodePage: UINT): AnsiString; begin LocaleCharsFromUnicode(CodePage, ..., s, ...); ... if Length(Result) > 0 then PStrRec(PByte(Result) - SizeOf(StrRec)).codePage := CodePage; end;
Then the rest of the VCL will fall in line.
-
mistertodd over 9 years
-
-
David Heffernan over 9 yearsRegarding the final section, that cannot help when the code page is determined at runtime
-
Remy Lebeau over 9 years@DavidHeffernan: True, which is where
RawByteString
comes into play. It is best not to mess with the codepage associated with a plainAnsiString
. ConvertingUnicodeString
to aRawByteString
with appropriate codepage, and storing Ansi data in aRawByteString
with appropriate codepage before assigning it toUnicodeString
, is the best option in this situation. -
mistertodd over 9 yearsI don't have my heart set on returning an
AnsiString
; i just have to be very careful change all existing code everywhere (in every project, in every shared library, in every dll, etc). It's especially troublesome in code that has been built into binaries (e.g. dll/bpl). -
mistertodd over 9 yearsThe use of
AnsiString
at the time was a convenience. It featured ease of declaration, ease of manipulation, copy-on-write, easy to return. (cf. parameters declaredarray of Byte
, and return types declaredTByteDynArray
). -
Remy Lebeau over 9 yearsThis is the type of situation where hinting directives, like
deprecated
, come in handy. Mark your affected functions/classes/variables asdeprecated
and let the compiler show you everywhere they are being used, and then you can adjust that code as needed. -
David Heffernan over 9 yearsPassing Delphi strings across DLL boundaries is always wrong. You do need to stop doing that. And if you aren't passing across DLL boundaries then the fact that the code resides in DLLs is neither here nor there.
-
mistertodd over 9 yearsIn this case i would have the overloads
WideStringToStr(UnicodeString, UINT): AnsiString
andWideStringToStr(UnicodeString, UINT): RawByteString
. I'm pretty sure Delphi will refuse to recognize the difference between them. I believe there will be Implicit cast from 'RawByteString' to 'AnsiString' with potential data loss warnings. -
Remy Lebeau over 9 yearsYou cannot overload on return type alone. And it does not male sense to return a codepaged
AnsiString
instead ofRawByteString
. -
Remy Lebeau over 9 yearsAlthough the
TEncoding
usage is a little cleaner than usingLocaleCharsFromUnicode()
(orWideCharToMultiByte()
), it does require using twice as much memory (one for theTBytes
, then again for theResult
), even if temporarily. BTW, you can useSetString()
instead ofSetLength()
/Move()
. -
Remy Lebeau over 9 yearsAs for the
LocaleCharsFromUnicode()
documentation, it was probably written at a time whenLocaleCharsFromUnicode()
really was just a wrapper forWideCharToMultiByte()
by itself before new platforms were added, and not updated accordingly. -
David Heffernan over 9 years@Remy If LocaleCharsFromUnicode were added when Win32 was the only compiler then why would it have been added? Surely it exists to support cross platform. Even if it were added when bcc32 was the only compiler, surely they could have seen into the future. Anyway, I'm speculating. Not the most productive task.
-
Remy Lebeau over 9 years@DavidHeffernn: Yes, they were added for cross-platform, per comments in the
System
unit:"LocaleCharsFromUnicode is a cross-platform wrapper for WideCharToMultiByte with an emulated implemention on non-Windows platforms."
and"UnicodeFromLocaleChars is a cross-platform wrapper for MultiByteToWideChar with an emulated implemention on non-Windows platforms."
The two functions were added in XE, but FireMonkey and OSX support were introduced in XE2. -
SpaghettiCook over 9 years@David-Heffernan, aren't there situations where dll's expect buffers filled with ansistrings? You don't literally pass delphi strings across the dll boundary but you do 'communicate' the string via a buffer and the dll expects a specific encoding? I may be wrong but it seems like a valid argument.
-
David Heffernan over 9 years@SpaghettiCook I don't get your point. What I am saying is not to use a Delphi native string type as a parameter in an exported function. Do you disagree with that?
-
SpaghettiCook over 9 years@David-Heffernan, of course I agree; no native strings across dll boundaries. I am missing your point that the dll argument above is here nor there. The here nor there is what I do not understand. If you need to pass a buffer with ansichars you probably want to convert your unicode to ansistring just like the question states. It seems you may take a different approach. What it that would be is unclear.
-
mistertodd over 9 yearsI came across the built-in function
SetAnsiString
which seems to solve a lot of my pain for me (in an officially supported and documented way). -
Hugie over 8 yearsHow is the best/correct way to do this the other way around? Convert an AnsiString e.g. from a file with a different codepage then the OS to a Unicode String? Example: System CP is 1252, the files codepage is 1250 (described in some meta-header-info)
-
Hugie over 8 yearsIs it sufficient to do "SetCodePage(MyCP1250AnsiString, 1250, false)" on the Ansistring and then just do MyUnicodeString := MyCP1250AnsiString to convert it to a unicode string? Or am i missing something?
-
mistertodd over 8 years@Hugie For that case i use the Windows function
MultiByteToWideChar
. You give it the bytes of an AnsiString and the code-page that it is currently in, and Windows can convert it to UTF-16 for you. -
Hugie over 8 yearsThis seems to be similar to the approach through the Delphi API which i finaly found:
UniStr := String( TEncoding.GetEncoding(SourceCP).GetChars( SourceBytes ) ) ;
Thx for your extra work.