Writing utf16 to file in binary mode

10,219

Solution 1

I suspect that sizeof(wchar_t) is 4 in your environment - i.e. it's writing out UTF-32/UCS-4 instead of UTF-16. That's certainly what the hex dump looks like.

That's easy enough to test (just print out sizeof(wchar_t)) but I'm pretty sure it's what's going on.

To go from a UTF-32 wstring to UTF-16 you'll need to apply a proper encoding, as surrogate pairs come into play.

Solution 2

Here we run into the little used locale properties. If you output your string as a string (rather than raw data) you can get the locale to do the appropriate conversion auto-magically.

N.B.This code does not take into account edianness of the wchar_t character.

#include <locale>
#include <fstream>
#include <iostream>
// See Below for the facet
#include "UTF16Facet.h"

int main(int argc,char* argv[])
{
   // construct a custom unicode facet and add it to a local.
   UTF16Facet *unicodeFacet = new UTF16Facet();
   const std::locale unicodeLocale(std::cout.getloc(), unicodeFacet);

   // Create a stream and imbue it with the facet
   std::wofstream   saveFile;
   saveFile.imbue(unicodeLocale);


   // Now the stream is imbued we can open it.
   // NB If you open the file stream first. Any attempt to imbue it with a local will silently fail.
   saveFile.open("output.uni");
   saveFile << L"This is my Data\n";


   return(0);
}    

The File: UTF16Facet.h

 #include <locale>

class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {
       // Loop over both the input and output array/
       for(;(from < from_end) && (to < to_limit);from += 2,++to)
       {
           /*Input the Data*/
           /* As the input 16 bits may not fill the wchar_t object
            * Initialise it so that zero out all its bit's. This
            * is important on systems with 32bit wchar_t objects.
            */
           (*to)                               = L'\0';

           /* Next read the data from the input stream into
            * wchar_t object. Remember that we need to copy
            * into the bottom 16 bits no matter what size the
            * the wchar_t object is.
            */
           reinterpret_cast<char*>(to)[0]  = from[0];
           reinterpret_cast<char*>(to)[1]  = from[1];
       }
       from_next   = from;
       to_next     = to;

       return((from > from_end)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {
       for(;(from < from_end) && (to < to_limit);++from,to += 2)
       {
           /* Output the Data */
           /* NB I am assuming the characters are encoded as UTF-16.
            * This means they are 16 bits inside a wchar_t object.
            * As the size of wchar_t varies between platforms I need
            * to take this into consideration and only take the bottom
            * 16 bits of each wchar_t object.
            */
           to[0]     = reinterpret_cast<const char*>(from)[0];
           to[1]     = reinterpret_cast<const char*>(from)[1];

       }
       from_next   = from;
       to_next     = to;

       return((to > to_limit)?partial:ok);
   }
};

Solution 3

It is easy if you use the C++11 standard (because there are a lot of additional includes like "utf8" which solves this problems forever).

But if you want to use multi-platform code with older standards, you can use this method to write with streams:

  1. Read the article about UTF converter for streams
  2. Add stxutif.h to your project from sources above
  3. Open the file in ANSI mode and add the BOM to the start of a file, like this:

    std::ofstream fs;
    fs.open(filepath, std::ios::out|std::ios::binary);
    
    unsigned char smarker[3];
    smarker[0] = 0xEF;
    smarker[1] = 0xBB;
    smarker[2] = 0xBF;
    
    fs << smarker;
    fs.close();
    
  4. Then open the file as UTF and write your content there:

    std::wofstream fs;
    fs.open(filepath, std::ios::out|std::ios::app);
    
    std::locale utf8_locale(std::locale(), new utf8cvt<false>);
    fs.imbue(utf8_locale); 
    
    fs << .. // Write anything you want...
    

Solution 4

The provided Utf16Facet didn't work in gcc for big strings, here is the version that worked for me... This way the file will be saved in UTF-16LE. For UTF-16BE, simply invert the assignments in do_in and do_out, e.g. to[0] = from[1] and to[1] = from[0]

#include <locale>
#include <bits/codecvt.h>


class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {

       for(;from < from_end;from += 2,++to)
       {
           if(to<=to_limit){
               (*to)                               = L'\0';

               reinterpret_cast<char*>(to)[0]  = from[0];
               reinterpret_cast<char*>(to)[1]  = from[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {

       for(;(from < from_end);++from, to += 2)
       {
           if(to <= to_limit){

               to[0]     = reinterpret_cast<const char*>(from)[0];
               to[1]     = reinterpret_cast<const char*>(from)[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }
};
Share:
10,219
Cactuar
Author by

Cactuar

Updated on July 11, 2022

Comments

  • Cactuar
    Cactuar almost 2 years

    I'm trying to write a wstring to file with ofstream in binary mode, but I think I'm doing something wrong. This is what I've tried:

    ofstream outFile("test.txt", std::ios::out | std::ios::binary);
    wstring hello = L"hello";
    outFile.write((char *) hello.c_str(), hello.length() * sizeof(wchar_t));
    outFile.close();
    

    Opening test.txt in for example Firefox with encoding set to UTF16 it will show as:

    h�e�l�l�o�

    Could anyone tell me why this happens?

    EDIT:

    Opening the file in a hex editor I get:

    FF FE 68 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F 00 00 00 
    

    Looks like I get two extra bytes in between every character for some reason?

  • Cactuar
    Cactuar over 15 years
    Yeah, you are correct wchar_t is of size 4, I'm in a mac. So that explains a lot :) I'm aware of the surrogate pairs in UTF-16, will have to look into that a bit more.
  • Martin York
    Martin York over 15 years
    From the output you can not tell it it is UTF-16 or UTF-32 all it shows is that wchar_t is 4 bytes wide. The encoding of the string is not defined by the language (though it is most likely to be UCS-4).
  • Jakob Riedle
    Jakob Riedle about 7 years
    Note, that your Facet implements the conversion to/from UCS-2, not UTF-16. UTF-16 is a variable length encoding that instruments so called surrogate pairs. UCS-2 is a subset of Unicode, which is the reason UTF-16 has been invented.