How do I remove the character "ï»¿" from the beginning of a text file in C++?

c++ byte

17,321

Solution 1

That's UTF-8's BOM

You need to read the file as UTF-8. If you don't need Unicode and just use the first 127 ASCII code points then save the file as ASCII or UTF-8 without BOM

Solution 2

This is Byte Order Mark (BOM). It's the representation for the UTF-8 BOM in ISO-8859-1. You have to tell your editor to not use BOMs or use a different editor to strip them out.

In C++, you can use the following function to convert a UTF-8 BOM file to ANSI.

void change_encoding_from_UTF8BOM_to_ANSI(const char* filename)
{
    ifstream infile;
    string strLine="";
    string strResult="";
    infile.open(filename);
    if (infile)
    {
        // the first 3 bytes (ef bb bf) is UTF-8 header flags
        // all the others are single byte ASCII code.
        // should delete these 3 when output
        getline(infile, strLine);
        strResult += strLine.substr(3)+"\n";

        while(!infile.eof())
        {
            getline(infile, strLine);
            strResult += strLine+"\n";
        }
    }
    infile.close();

    char* changeTemp=new char[strResult.length()];
    strcpy(changeTemp, strResult.c_str());
    char* changeResult = change_encoding_from_UTF8_to_ANSI(changeTemp);
    strResult=changeResult;

    ofstream outfile;
    outfile.open(filename);
    outfile.write(strResult.c_str(),strResult.length());
    outfile.flush();
    outfile.close();
}

17,321

Author by

Hoang Minh

Updated on June 26, 2022

Comments

Hoang Minh about 2 years

I'm trying to read a text file, and for each word, I will put them into a node of a binary search tree. However, the first character is always read as "ï»¿ + first word". For example, if my first word is "This", then the first word that is inserted into my node is "ï»¿This". I've been searching the forum for a solution to fix it, there was one post asking the same problem in Java, but no one has addressed it in C++. Would anyone help me to fix it ? Thank you.

I came to the a simple solution. I opened the file in Notepad, and saved it as ANSI. After that, the file is reading and passing correctly into the binary search tree
Jonathan Leffler over 10 years

It is strictly the UTF-8 encoding of U+FEFF, the BOM (also a zero-width no-breaking space, ZWNBSP), presented using the code set ISO 8859-1. UTF-8 does not need a BOM, of course.