Saving data to a binary file

39,829

Solution 1

All files contain only ones and zeroes, on binary computers that's all there is to play with.

When you save text, you are saving the binary representation of that text, in a given encoding that defines how each letter is mapped to bits.

So for text, a text file or a binary file almost doesn't matter; the savings in space that you've heard about generally come into play for other data types.

Consider a floating point number, such as 3.141592653589. If saved as text, that would take one character per digit (just count them), plus the period. If saved in binary as just a copy of the float's bits, it will take four characters (four bytes, or 32 bits) on a typical 32-bit system. The exact number of bits stored by a call such as:

FILE *my_file = fopen("pi.bin", "wb");
float x = 3.1415;
fwrite(&x, sizeof x, 1, my_file);

is CHAR_BIT * sizeof x, see <stdlib.h> for CHAR_BIT.

Solution 2

The problem you describe is a chain of (very common1, unfortunately) mistakes and misunderstandings. Let me try to fully detail what is going on, hopefully you will take the time to read through all the material: it is lengthy, but these are very important basics that any programmer should master. Please do not despair if you do not fully understand all of it: just try to play around with it, come back in a week, or two, practice, see what happens :)

There is a crucial difference between the concepts of a character encoding and a character set. Unless you really understand this difference, you will never really get what is going on, here. Joel Spolsky (one of the founders of Stackoverflow, come to think of it) wrote an article explaining the difference a while ago: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Before you continue reading this, before you continue programming, even, read that. Honestly, read it, understand it: the title is no exaggeration. You must absolutely know this stuff.

After that, let us proceed:

When a C program runs, a memory location that is supposed to hold a value of type "char" contains, just like any other memory location, a sequence of ones and zeroes. "type" of a variable only means something to the compiler, not to the running program who just sees ones and zeroes and does not know more than that. In other words: where you commonly think of a "letter" (an element from a character set) residing in memory somewhere, what is actually there is a bit sequence (an element from a character encoding).

Every compiler is free to use whatever encoding they wish to represent characters in memory. As a consequence, it is free represent what we call a "newline" internally as any number it chooses. For example, say I write a compiler, I can agree with myself that every time I want to store a "newline" internally I store it as number six (6), which is just 0x6 in binary (or 110 in binary).

Writing to a file is done by telling the operating system2 four things at the same time:

  • The fact that you want to write to a file (fwrite())
  • Where the data starts that you want to write (first argument to fwrite)
  • How much data you want to write (second and third argument, multiplied)
  • What file you want to write to (last argument)

Note that this has nothing to do with the "type" of that data: your operating has no idea, and does not care. It does not know anything about characters sets and it does not care: it just sees a sequence of ones and zeroes starting somewhere and copies that to a file.

Opening a file in "binary" mode is actually the normal, intuitive way of dealing with files that a novice programmer would expect: the memory location you specify is copied one-on-one to the file. If you write a memory location that used to hold variables that the compiler decided to store as type "char", those values are written one-on-one to the file. Unless you know how the compiler stores values internally (what value it associates with a newline, with a letter 'a', 'b', etc), THIS IS MEANINGLESS. Compare this to Joel's similar point about a text file being useless without knowing what its encoding is: same thing.

Opening a file in "text" mode is almost equal to binary mode, with one (and only one) difference: anytime a value is written that has value equal to what the compiler uses INTERNALLY for the newline (6, in our case), it writes something different to the file: not that value, but whatever the operating system you are on considers to be a newline. On windows, this is two bytes (13 and 10, or 0x0d 0x0a, on Windows). Note, again, if you do not know about the compiler's choice of internal representation of the other characters, this is STILL MEANINGLESS.

Note at this point that it is pretty clear that writing anything but data that the compiler designated as characters to a file in text mode is a bad idea: in our case, a 6 might just happen to be among the values you are writing, in which case the output is altered in a way that we absolutely do not mean to.

(Un)Luckily, most (all?) compilers actually use the same internal representation for characters: this representation is US-ASCII and it is the mother of all defaults. This is the reason you can write some "characters" to a file in your program, compiled with any random compiler, and then open it with a text editor: they all use/understand US-ASCII and it happens to work.

OK, now to connect this to your example: why is there no difference between writing "test" in binary mode and in text mode? Because there is no newline in "test", that is why!

And what does it mean when you "open a file", and then "see" characters? It means that the program you used to inspect the sequence of ones and zeroes in that file (because everything is ones and zeroes on your hard disk) decided to interpret that as US-ASCII, and that happened to be what your compiler decided to encode that string as, in its memory.

Bonus points: write a program that reads the ones and zeroes from a file into memory and prints every BIT (there's multiple bits to make up one byte, to extract them you need to know 'bitwise' operator tricks, google!) as a "1" or "0" to the user. Note that "1" is the CHARACTER 1, the point in the character set of your choosing, so your program must take a bit (number 1 or 0) and transform it to the sequence of bits needed to represent character 1 or 0 in the encoding that the terminal emulator uses that you are viewing the standard out of the program on oh my God. Good news: you can take lots of short-cuts by assuming US-ASCII everywhere. This program will show you what you wanted: the sequence of ones and zeroes that your compiler uses to represent "test" internally.

This stuff is really daunting for newbies, and I know that it took me a long time to even know that there was a difference between a character set and an encoding, let alone how all of this worked. Hopefully I did not demotivate you, if I did, just remember that you can never lose knowledge you already have, only gain it (ok not always true :P). It is normal in life that a statement raises more questions than it answered, Socrates knew this and his wisdom seamlessly applies to modern day technology 2.4k years later.

Good luck, do not hesitate to continue asking. To other readers: please feel welcome to improve this post if you see errors.

Hraban

1 The person that told you that "saving a file in binary is probably smaller", for example, probably gravely misunderstands these fundamentals. Unless he was referring to compressing the data before you save it, in which case he just uses a confusing word ("binary") for "compressed".

2 "telling the operating system something" is what is commonly known as a system call.

Solution 3

Well, the difference between native and binary is the way the end of line is handled. If you write a string in a binary, it will stay the string.

If you want to make it smaller, you'll have to somehow compress it (look for libz for example).

What is smaller is: when wanting to save binary data (like an array of bytes), it's smaller to save it as binary rather than putting it in a string (either in hexa representation or base64). I hope this helps.

Solution 4

I think you're a bit confused here.

The ASCII-string "Test" will still be an ASCII-string when you write it to the file (even in binary mode). The cases when it makes sense to write binary are for other types than chars (e.g. an array of integers).

Share:
39,829
Datoxalas
Author by

Datoxalas

I am a happy programmer!

Updated on June 24, 2020

Comments

  • Datoxalas
    Datoxalas almost 4 years

    I would like to save a file as binary, because I've heard that it would probably be smaller than a normal text file.

    Now I am trying to save a binary file with some text, but the problem is that the file just contains the text and NULL at the end. I would expect to see only zero's and one's inside the file.

    Any explaination or suggestions are highly appreciated.

    Here is my code

    #include <iostream>
    #include <stdio.h>
    
    int main()
    {
         /*Temporary data buffer*/
         char buffer[20];
    
         /*Data to be stored in file*/
         char temp[20]="Test";
    
         /*Opening file for writing in binary mode*/
         FILE *handleWrite=fopen("test.bin","wb");
    
         /*Writing data to file*/
         fwrite(temp, 1, 13, handleWrite);
    
         /*Closing File*/
         fclose(handleWrite);
    
        /*Opening file for reading*/
        FILE *handleRead=fopen("test.bin","rb");
    
        /*Reading data from file into temporary buffer*/
        fread(buffer,1,13,handleRead);
    
        /*Displaying content of file on console*/
        printf("%s",buffer);
    
        /*Closing File*/
        fclose(handleRead);
        std::system("pause");
    
        return 0;
    }
    
  • Datoxalas
    Datoxalas about 13 years
    I am getting the same result.
  • Datoxalas
    Datoxalas about 13 years
    Thanks for your explaination.
  • Datoxalas
    Datoxalas about 13 years
    Still getting the same results.
  • Datoxalas
    Datoxalas about 13 years
    I'm not asking anything about printf() here.
  • Datoxalas
    Datoxalas about 13 years
    Do you mean zlib or libz?
  • Bruce
    Bruce about 13 years
    I was talking about zlib, sorry for quick typing :-)
  • Datoxalas
    Datoxalas about 13 years
    Could you help me a little bit on that please? stackoverflow.com/questions/5649030/working-with-zlib