Storing a string as UTF8 in C#

31,068

Solution 1

As you've found, the CLR uses UTF-16 for character encoding. Your best bet may be to use the Encoding classes & a BitConverter to handle the text. This question has some good examples for converting between the two encodings:

Convert String (UTF-16) to UTF-8 in C#

Solution 2

Well, you could create a wrapper that retrieves the data as UTF-8 bytes and converts pieces as needed to System.String, then vice-versa to push the string back out to memory. The Encoding class will help you out here:

var utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(myString);

var myReturnedString = utf8.GetString(utfBytes);

Solution 3

Not really. System.String is designed for storing strings. Your requirement is for a very particular subset of strings with particular memory benefits.

Now, "very particular subset of strings with particular memory benefits" comes up a lot, but not always the same very particular subset. Code that is ASCII-only isn't for reading by human beings, so it tends to be either short codes, or something that can be handled in a stream-processing manner, or else chunks of text merged in with bytes doing other jobs (e.g. quite a few binary formats will have small bits that translate directly to ASCII).

As such, you've a pretty strange requirement.

All the more so when you come to the gigabytes part. If I'm dealing with gigs, I'm immediately thinking about how I can stop having to deal with gigs, and/or get much more serious savings than just 50%. I'd be thinking about mapping chunks I'm not currently interested in to a file, or about ropes, or about a bunch of other things. Of course, those are going to work for some cases and not for all, so yet again, we're not talking about something where .NET should stick in something as a one-size-fits-all, because one size will not fit all.

Beyond that, just the utf-8 bit isn't that hard. It's all the other methods that becomes work. Again, what you need there won't be the same as someone else.

Solution 4

As I can see your problem is that char in C# is occupying 2 bytes, instead of one.

One way to read a text file is to open it with :

    System.IO.FileStream fs = new System.IO.FileStream(file, System.IO.FileMode.Open);
    System.IO.BinaryReader br = new System.IO.BinaryReader(fs);

    byte[] buffer = new byte[1024];
    int read = br.Read(buffer, 0, (int)fs.Length);

    br.Close();
    fs.Close(); 

And this way you are reading the bytes from the file. I tried it with *.txt files encoded in UTF-8 that is 2 bytes per char, and ANSI that is 1 byte per char.

Share:
31,068
LondonPhantom
Author by

LondonPhantom

Updated on August 28, 2020

Comments

  • LondonPhantom
    LondonPhantom over 3 years

    I'm doing a lot of string manipulation in C#, and really need the strings to be stored one byte per character. This is because I need gigabytes of text simultaneously in memory and it's causing low memory issues. I know for certain that this text will never contain non-ASCII characters, so for my purposes, the fact that System.String and System.Char store everything as two bytes per character is both unnecessary and a real problem.

    I'm about to start coding my own CharAscii and StringAscii classes - the string one will basically hold its data as byte[], and expose string manipulation methods similar to the ones that System.String does. However this seems a lot of work to do something that seems like a very standard problem, so I'm really posting here to check that there isn't already an easier solution. Is there for example some way I can make System.String internally store data as UTF8 that I haven't noticed, or some other way round the problem?

  • tmesser
    tmesser over 11 years
    +1, I investigated this problem myself when harvesting mass amounts of data for a real estate company and this solution, while a bit magical and seemingly janky, is pretty much the best thing I was able to come up with in C#.
  • Yinda Yin
    Yinda Yin over 11 years
    It's not so strange. The OP wants strings that work the same way as System.String, but take up half the space. A drop-in replacement, in other words.
  • Tigran
    Tigran over 11 years
    this ends up into the UTF-16 encoded string object, by the way.
  • Jon Hanna
    Jon Hanna over 11 years
    @RobertHarvey Yes, but they e.g. won't want a O(n) length because they know they don't need it from the knowledge of their data. Someone with similar but not identical needs for a utf-8 based string will need a O(n) count because they aren't sticking to ASCII-only. The general problem comes up a lot, but the tiny details vary and that makes one guy's perfect drop-in replacement another guy's poison.
  • tmesser
    tmesser over 11 years
    @Tigran, there is no way to get around that if you are going to use System.String at any point. You can, however, pull out subsections of the encoded byte array and write them out in a controlled way, leaving an upper limit on how many resources you're sucking up.
  • paparazzo
    paparazzo over 11 years
    @Tigran please elaborate. utf8 is not really utf8?
  • Tigran
    Tigran over 11 years
    @YYY: the actual point of this question is: how to get arround of that problem. I think it's clear for OP's that it's possible to do, but this doesn't resolve the memory space problem.
  • KeithS
    KeithS over 11 years
    @Tigran - Yes it does. Unless the OP wants to completely forego everything about Strings that you get for free with the .NET Framework (which I strongly recommend against), at least some of the data he's working with will have to be converted to and from a UTF-16 System.String to work with it. but, the untold gigs of data he's working with overall can remain in UTF-8 (or even ASCII if he really is certain the data will not contain any non-ASCII characters).
  • Tigran
    Tigran over 11 years
    @Blam: UTF8 is for converting from byte array, encoded with that encoding, but it ends up into the same stndart CLR string.
  • Jon Hanna
    Jon Hanna over 11 years
    Well, at least it only does so for one string at a time, and they can do some work with just the byte arrays all in memory. It gets them somewhere.
  • Tigran
    Tigran over 11 years
    @KeithS: I'm afraid that it's not a solution, cause OP's esplicitly declares the wish to begin write custom SingleByteString class.
  • LondonPhantom
    LondonPhantom over 11 years
    Robert Harvey has it exactly. Jon - What I'm doing, very roughly, involves extensive cross-referencing between bits of text. As such, it would be very hard to avoid having the entire text in memory during the processing. Writing chunks I'm not immediately interested in to a file only to have to read them back a milisecond later would I imagine be dreadful for performance! (As well as making the code more complicated)
  • Jon Hanna
    Jon Hanna over 11 years
    I get you, where I disagree is in saying what hurts your case would help another and vice-versa, hence the lack of one-size fits all. Another possibility is that mutability would be a big help to you (if you do a bunch of same-size replacements) and a hurt to someone else (they can no longer get the memory boost of safely aliasing "different" strings that are actually the same. Or vice-versa. Or that factor is irrelevant to you. System.String is designed to be efficient over many cases. Once that's not good enough you need to think about your case, not hope for a general-purpose.
  • KeithS
    KeithS over 11 years
    ... and if he does he'll end up either having to rewrite every string manipulation function he currently has for free, and a host of other methods in his codebase and in the Framework that take System.String, or he'll see the better part of valor and set it up the way I suggested, converting small pieces to strings and back. This solves the memory space problem by allowing him to keep the overwhelming majority of his data in UTF-8.
  • LondonPhantom
    LondonPhantom over 11 years
    That's definitely another way of doing it. I'd be a little worried about the performance hit though if every bit of processing I do ends up embedded in a round-trip UTF16 conversion. (Of course it may be that that's partially made up for by System.String itself being more efficient at internal operations than I could hope to achieve in a custom StringAscii class).
  • Tigran
    Tigran over 11 years
    @PhantomDrummer: the basic answer is that to do what you would like to do is too complicated, so it's better to focus on solutions provided here, or use persistant stores (say sqlite), in short arhitectual solution based on already invented and what is more important tested and workable stuff.
  • KeithS
    KeithS over 11 years
    @PhantomDrummer - UTF encoding conversion is actually pretty cheap, especially when converting char values 0-127. The only change between UTF-8 and UTF-16 for ordinals 0x0000-0x007F is to subtract 128 and append or prepend a 0 byte (depending on endianness of the UTF-16 variant). This is almost always "good enough" for .NET programmers; if you want better you probably don't want .NET, as was suggested by a comment to your OP. Getting rid of the CLR's overhead is a bigger performance boost than avoiding the conversion in the first place.
  • LondonPhantom
    LondonPhantom over 11 years
    Do you mean UTF-16? UTF-8 will, like ANSI, be 1 byte per char for the particular data I'm asking about. But thanks, that is in fact exactly the way I'll be reading the data.
  • Thanatos
    Thanatos over 11 years
    @PhantomDrummer I actually tried UTF-8, notepads usual encoding, and it took 2 bytes per char :) glad to help
  • LondonPhantom
    LondonPhantom over 11 years
    Yes that's a fair point. Going down that reasoning I would need to write something for my specific problem, which in turn implies writing my own classes is the correct approach.
  • Jon Hanna
    Jon Hanna over 11 years
    Yep. Now, if I were you, I'd still do a hunt for someone having written a nice open-source utf8string class that matched what I needed, because sometimes we do get lucky. Even then though, I'd expect there to be some point where their clever memory-saver had to be cut out as a disaster to me, or where I could do a memory-saving trick that would have ruined them.
  • Jon Hanna
    Jon Hanna over 11 years
    Speaking of memory-saving tricks, do you know about ropes as in en.wikipedia.org/wiki/Rope_%28computer_science%29 as yet another example of something that is great for some people who have to deal with very large strings, and absolutely useless to others. Thought I'd mention it in case you were in the former camp :)
  • LondonPhantom
    LondonPhantom over 11 years
    Thanks. Marked this as the answer since the link contains lots of info about doing the conversion. I think the approach that you and KeithS suggest is probably the best compromise in my situation between maximum performance and getting some kind of solution that saves memory without taking too long to implement.