C++ reading a file in binary mode. Problems with END OF FILE

13,391

Solution 1

I've eventually figured this out. Apparently it seems the problem wasn't due to any code. The problem was gedit. It always appends a newline character at the end of file. This also happen with other editors, such as vim. For some editor this can be configured to not append anything, but in gedit this is apparently not possible. https://askubuntu.com/questions/13317/how-to-stop-gedit-gvim-vim-nano-from-adding-end-of-file-newline-char

Cheers to everyone who asked me,

Marco

Solution 2

fstream::get returns an int-value. This is one of the problems.

Secondly, you are reading in binary, so you shouldn't use formatted streams. You should use fstream::read:

// read a file into memory
#include <iostream>     // std::cout
#include <fstream>      // std::ifstream

int main () {

  std::ifstream is ("test.txt", std::ifstream::binary);
  if (is) {
    // get length of file:
    is.seekg (0, is.end);
    int length = is.tellg();
    is.seekg (0, is.beg);

    char * buffer = new char [length];

    std::cout << "Reading " << length << " characters... ";
    // read data as a block:
    is.read (buffer,length);

    if (is)
      std::cout << "all characters read successfully.";
    else
      std::cout << "error: only " << is.gcount() << " could be read";
    is.close();

    // ...buffer contains the entire file...

    delete[] buffer;
  }
  return 0;
}

Solution 3

This isn't the way istream::get() was designed to be used. The classical idiom for using this function would be:

for ( int val = in.get(); val != EOF; val = in.get() ) {
    //  ...
}

or even more idiomatic:

char ch;
while ( in.get( ch ) ) {
    //  ...
}

The first loop is really inherited from C, where in.get() is the equivalent of fgetc().

Still, as far as I can tell, the code you give should work. It's not idiomatic, and it's not

The C++ standard is unclear what it should return if the character value read is negative. fgetc() requires a value in the range [0...UCHAR_MAX], and I think it safe to assume that this is the intent here. It is, at least, what every implementation I've used does. But this doesn't affect your input. Depending on how the implementation interprets the standard, the return value of in.get() must be in the range [0...UCHAR_MAX] or [CHAR_MIN...CHAR_MAX], or it must be EOF (typically -1). (The reason I'm fairly sure that the intent is to require [0...UCHAR_MAX] is because otherwise, you may not be able to distinguish end of file from a valid character.)

And if the return value is EOF (almost always -1), failbit should be set, so in.good() would return false. There is no case where in.get() would be allowed to return 221497852. The only explication I can possibly think of for your results is that your file has some character with bit 7 set at the end of the file, that the implementation is returning a negative number for this (but not end of file, because it is a character), which results in an out of bounds index in values[val], and that this out of bounds index somehow ends up modifying val. Or that your implementation is broken, and is not setting failbit when it returns end of file.

To be certain, I'd be interested in knowing what you get from the following:

std::ifstream in( "text.txt", std::ios_base::binary );
int ch = in.get();
while ( ch != std::istream::traits_type::eof() ) {
    std::cout << ch << std::endl;
    ch = in.get();
}

This avoids any issues of a possibly invalid index, and any type conversions (although the conversion int to unsigned is well defined). Also, out of curiosity (since I can only access VC++ here), you might try replacing in as follows:

std::istringstream in( "\n\xE5" );

I would expect to get:

10
233

(Assuming 8 bit bytes and an ASCII based code set. Both of which are almost, but not quite universal today.)

Share:
13,391
smellyarmpits
Author by

smellyarmpits

The Joel Test Do you use source control? Can you make a build in one step? Do you make daily builds? Do you have a bug database? Do you fix bugs before writing new code? Do you have an up-to-date schedule? Do you have a spec? Do programmers have quiet working conditions? Do you use the best tools money can buy? Do you have testers? Do new candidates write code during their interview? Do you do hallway usability testing?

Updated on June 05, 2022

Comments

  • smellyarmpits
    smellyarmpits almost 2 years

    I am learning C++and I have to read a file in binary mode. Here's how I do it (following the C++ reference):

    unsigned values[255];
    unsigned total;
    ifstream in ("test.txt", ifstream::binary);
    
    while(in.good()){
        unsigned val = in.get();
        if(in.good()){
            values[val]++;
            total++;
            cout << val <<endl;
        }
    }
    
    in.close();
    

    So, I am reading the file byte per byte till in.good() is true. I put some cout at the end of the while in order to understand what's happening, and here is the output:

    marco@iceland:~/workspace/huffman$ ./main 
    97
    97
    97
    97
    10
    98
    98
    10
    99
    99
    99
    99
    10
    100
    100
    10
    101
    101
    10
    221497852
    marco@iceland:~/workspace/huffman$
    

    Now, the input file "test.txt" is just:

    aaaa
    bb
    cccc
    dd
    ee
    

    So everything works perfectly till the end, where there's that 221497852. I guess it's something about the end of file, but I can't figure the problem out.

    I am using gedit & g++ on a debian machine(64bit). Any help help will be appreciated.

    Many thanks,

    Marco

  • hmjd
    hmjd almost 11 years
    get() is unformatted, according to en.cppreference.com/w/cpp/io/basic_istream/get ?
  • smellyarmpits
    smellyarmpits almost 11 years
    aren't chars actually integers? What is the problem if I'm using get and assigning its return values to an unsigned int variable?
  • bash.d
    bash.d almost 11 years
    Well, it's not that easy. char can either be signed or unsigned by default, and if you use unsigned you cannot obtain such values as -1 which might indicate eof (system-dependent)
  • James Kanze
    James Kanze almost 11 years
    This answer is simply incorrect. The OP's code is not idiomatic, and doesn't use istream::get() the way I think it was meant to be used, but I don't see anything that would make it not work, unless istream::get() returned a negative value when it hadn't encountered end of file.
  • James Kanze
    James Kanze almost 11 years
    @bash.d But he's not using the return value to test for end of file. It's not idiomatic, but the standard does say that if there is no character to extract, istream::get() should set failbit. (I would also expect it to set eofbit, under the more general definition of eofbit.)
  • bash.d
    bash.d almost 11 years
    Which answer is simply incorrect?? This is what I stated, not to use fstream::get like this.
  • smellyarmpits
    smellyarmpits almost 11 years
    I've tried your code above and the resulting output is: none :D By the way. I tried to change my code substituting endl with '\n'. Now the output is almost the same, except the last number, which is now 10(the ASCII '\n' character). So i now guess it's something about the endl. Anyway i still can't understand what is the last 10 character, since the file has no \n at the end.
  • smellyarmpits
    smellyarmpits almost 11 years
    Actually I think am using the return value to check the eof. In fact, right after the reading with get(), I coded if(in.good()) that, as the reference says, "Returns true if none of the stream's error state flags (eofbit, failbit and badbit) is set." cplusplus.com/reference/ios/ios/good
  • bash.d
    bash.d almost 11 years
    You should check for eofbit explicitly!
  • James Kanze
    James Kanze almost 11 years
    @MarcoGalassi There's something else going on that you're not showing us. The code above was copy/pasted from a program which compiles and runs on my machine, and works; I've done almost exactly the same thing in the past with g++. And whether you use '\n' or std::endl only affects whether you flush the output immediately.
  • James Kanze
    James Kanze almost 11 years
    @bash.d You should almost never check for eofbit. Only after input has failed, to know whether it failed because you'd reached end of file, or whether it failed because of a format error.
  • James Kanze
    James Kanze almost 11 years
    @bash.d That the fact that istream::get() returns an int is part of the problem. That he should use istream::read() if the file is binary. Your example code isn't a very good example either.
  • James Kanze
    James Kanze almost 11 years
    @MarcoGalassi You're not using the return value of istream::get() to test for end of file. You're using the internal state of the istream. Both should work, but when using istream::get() (rather than istream::get( char& )), assigning the return value to an int, then comparing that with EOF, is the idiomatic way of doing things.
  • smellyarmpits
    smellyarmpits almost 11 years
    There is nothing more.. That's the code. I'm compiling using g++ main.cpp -o main and running the program with ./main I think there is somekind of issue from the end of file. I also read about endl which is supposed to be a \n with flushing, and that's why I can't understand the output changes. What machine are you using? Linux, Windows? What architecture? (I'm trying to check all the differences)
  • James Kanze
    James Kanze almost 11 years
    @MarcoGalassi The machine here, on which I just checked, is a Windows machine, but I've done exactly the same thing on Linux boxes and Sun Sparcs (under Solaris) in the past. If the third block of code I posted doesn't work, there's something wrong with your installation. (I've just run it under Windows, after carefully creating a file with Unix line endings and no last eol, and I get the output you got through the second 101, then nothing.)
  • smellyarmpits
    smellyarmpits almost 11 years
    When you talk about "something wrong with your installation", what are you referring at? g++? debian installation? or whetever
  • James Kanze
    James Kanze almost 11 years
    @MarcoGalassi I don't know. Probably g++. (I could be something like the headers you compile against being from a different version of the library than what you link against.) If I can remember and find time, I'll recheck it on my Linux machine tonight, but the third loop, simply wrapped in main, should work. If it doesn't, then there's no point in analysing your code further, since there is a problem in the installation itself.