How are bits stored in memory? (In chunks? Can there be bits of multiple sizes stored toghether?)

13,155

Solution 1

You'll be better off experimenting in C and/or assembly, rather than Java. Those languages are lower-level and expose the address space directly.

I used to think that each memory location contains 8, 16, 32 or 64 bits. So 0101 would be stored in an 8 bit machine as 00000101 (sign extended if it was negative). This was all fine and dandy until I wrote a program in java out of curiosity to find out some more inner workings of this system.

All memory locations in x86 systems contain 8 bits (1 byte). If a value contains more data than can fit into a single byte, it is stored using multiple bytes. For example, in C, the "float" type is stored using 4 bytes (32 bits).

All of it looks fine except for the space. It has 6 bits instead of 8. I'm now wondering how all of that information is stored in memory. If all of it was stored in 8 bit chunks, like

The space is also stored in a single byte. Your print code is forgetting to pad out to 8 spaces. 100000 == 00100000 == 0x20.

Solution 2

The space has 8 bits too. It's just that Integer.toBinaryString doesn't print leading 0 bits the way you used it.

With all the leading 0 bits, it actually looks like this in memory:

H : 01001000
e : 01100101
l : 01101100
l : 01101100
o : 01101111
  : 00100000
W : 01010111
o : 01101111
r : 01110010
l : 01101100
d : 01100100

Solution 3

Your original intuition was (mostly) correct: all memory locations consist of the same number of bits. On all modern machines, there are eight bits in a "byte", where a byte is the smallest chunk of memory that the machine can access individually.

Look closely at your output. You have seven digits in all of them except the space. The space just happens to begin with two zeroes in its binary representation, while the other letters begin with one.

Solution 4

Actually your approach is wrong. Encoding is very important here.

If you use ASCII then you can easily say that each character is stored in a byte (eight bits) but when encoding changes you cannot say that.

Eg: UTF-8 uses one to three bytes (8 to 24 bits) for each character on a string. That is why you will see an overload in which you can specify the encoding on inputstream object.

Choosing wrong input stream will absolutely cause a wrong string output. Thus you have to know the encoding of the file to understand which bit means what. Actually fileinputstream does this for you.

If you store a digit as string it will take a char length in hard drive. Just like another character.

However if you store 123456789 as string with ASCII encoding it will take 9*8 bits = 72 bits.

If you store this as integer, (note that integer's data width differs in different environments) it will only take 16 bits.

Also you cannot be sure that

H : 01001000
e : 01100101
l : 01101100
l : 01101100
o : 01101111
  : 00100000
W : 01010111
o : 01101111
r : 01110010
l : 01101100
d : 01100100
\n: 00001010

is stored in hard drive as H : 01001000 e : 01100101 l : 01101100 l : 01101100 o : 01101111 : 00100000 W : 01010111 o : 01101111 r : 01110010 l : 01101100 d : 01100100 \n: 00001010

You cannot be sure of that. File System is not that simple. Maybe Hello is successive but World string is at the end of drive. Thats why there is defrag command.

But if we talk about main memory (RAM) when you define a string i expect bits to be successive. At least in C it is. You define a string like that.

char[100] value; // c is a char array. (there is no string type in c)

here value[0] is the first character of our string. And value only addresses to the char arrays location in memory.

if value[0]'s address is 10 then value[1]'s address is 10+8 = 18.

Solution 5

The way computers store numbers can be compared to an odometer in a car. If the odometer has 4 digits, it stores the number 33 as "0033".

If someone asks you what your mileage is, you aren't going to say "zero thousand zero hundred and thirty three". By default, Java doesn't either. (Although you can tell it to.)

Then wouldn't storing a small number in a large bit space waste a lot of bits?

Well, not really. Suppose you had 11000100 in memory somewhere. How is the computer supposed to know whether this means 11000100, or 11000 followed by 100, or 1 followed by 1000 followed by 100, and so on?

Well, actually the computer is just following the program it is given (remember that a Java program is created partly by you and partly by the people who design Java). If you can create a viable system for saving bits, you can make the computer do it.

However, keep in mind that there's a tradeoff in terms of processor usage and programming difficulty. Since a typical computer can work with bytes much more quickly than it can with say, 7-bit or variable-bit numbers, storing ASCII codes in bytes is a very common choice for storing text.

But let me return to your question.

Then wouldn't storing a small number in a large bit space waste a lot of bits?

Mathematically speaking, no. A branch of mathematics called Information Theory tells us that the number of bits which are absolutely necessary depends on the possibilities you want to encode and how likely each of them is.

Let's suppose you have only a four letter alphabet (A, B, C, D), and use two-bit numbers (00, 01, 10, 11 respectively) to represent it. If each of these letters is equally likely, then the minimum number of bits required per letter (on average) is 2. In other words, there are no wasted bits even though A is 00 and B is 01.

On the other hand, if you use ASCII and encode A, B, C, D as the following 7-bit numbers:

A: 1000001
B: 1000010
C: 1000011
D: 1000100

then you are "wasting" 5 bits per letter (even though you're not "storing small numbers in a large bit space").

These sorts of considerations are important when designing compression algorithms, and not so important for everday applications. It's certainly important to understand bits and bytes if you wish to learn C.

Share:
13,155
Cobalt
Author by

Cobalt

Updated on June 30, 2022

Comments

  • Cobalt
    Cobalt almost 2 years

    I used to think that each memory location contains 8, 16, 32 or 64 bits. So 0101 would be stored in an 8 bit machine as 00000101 (sign extended if it was negative). This was all fine and dandy until I wrote a program in java out of curiosity to find out some more inner workings of this system.

    The method in question looks like this:

    public void printBinaryRep(File f){
            try{
                FileInputStream inputStream = new FileInputStream(f);
                int next = 0;
                byte b = 0;
                while((next = inputStream.read()) != -1){
                    b = (byte)next;
                    System.out.println((char)next + " : "+Integer.toBinaryString(next));
                }
                inputStream.close();
            }
            catch(Exception e){System.out.println(e);}
     }
    

    I got this output from a file that says Hello World

    H : 1001000
    e : 1100101
    l : 1101100
    l : 1101100
    o : 1101111
      : 100000
    W : 1010111
    o : 1101111
    r : 1110010
    l : 1101100
    d : 1100100
    

    All of it looks fine except for the space. It has 6 bits instead of 8. I'm now wondering how all of that information is stored in memory. If all of it was stored in 8 bit chunks, like

    Hello: 10010001100101110110011011001101111

    Then you can simply look at each 8 bit chunk and figure out what number it's representing (and then what ASCII code it's referring to). How does it work when a different sized character (like the 6 bit space and the 4 bit /n ) is stored along with them?? Then wouldn't storing a small number in a large bit space waste a lot of bits?

    I think I have some of the fundamental understanding wrong (or maybe the program's wrong somewhere...). Sorry if the question sounds strange or too un-necessarily in-depth. I just want to know. I've done some googling, but it didn't come up with anything relevent. If you can let me know where I've gone wrong or point me in the right direction, I'd greatly appreciate it. Thanks!

  • Paul Nathan
    Paul Nathan over 14 years
    I completely agree with the C/ASM approach.
  • Cobalt
    Cobalt over 14 years
    Thanks John, That makes more sense! I'm learning C, but I still tend to code for fun in Java since I'm more comfortable with it.
  • David Thornley
    David Thornley over 14 years
    Most modern computers have 8 bits per memory location. In the old days, I worked on systems with 60 bits (CDC supercomputers) and there were systems with 1 (Burroughs).
  • Peter Cordes
    Peter Cordes about 5 years
    Fun fact: Some modern DSPs (like some older computers) only have word-addressable memory. i.e. one address higher advances by 16 or 32 bits. On a word-addressable architecture, a byte is still 8 bits (unless we're talking about ancient systems with e.g. 36-bit words), but you can't load/store one individually. (You need to load the containing word and AND/OR to replace a byte, then store it back.) One advantage to word-addressable memory is that a 16-bit pointer can address word-size times 2^16 bytes of memory, instead of just 2^16. (So twice as many or 4x as many bits.)
  • Peter Cordes
    Peter Cordes about 5 years
    If you're going to talk about fragmentation of the physical data within a logical file, you could also consider that char value[100] might span a virtual page boundary and be split across 2 physical pages. And/or a multi-channel memory controller might interleave DRAM pages or cache lines between DRAM channels, so every other 64 byte cache line might be on separate a physical DIMM. But hardware handles memory mapping, unlike software filesystem drivers. Still, neither are relevant for a program that reads files; a filesystem is a level of indirection that presents logically-contiguous files