System where 1 byte != 8 bit?

c++ c byte history computer-architecture

29,034

Solution 1

On older machines, codes smaller than 8 bits were fairly common, but most of those have been dead and gone for years now.

C and C++ have mandated a minimum of 8 bits for char, at least as far back as the C89 standard. [Edit: For example, C90, §5.2.4.2.1 requires CHAR_BIT >= 8 and UCHAR_MAX >= 255. C89 uses a different section number (I believe that would be §2.2.4.2.1) but identical content]. They treat "char" and "byte" as essentially synonymous [Edit: for example, CHAR_BIT is described as: "number of bits for the smallest object that is not a bitfield (byte)".]

There are, however, current machines (mostly DSPs) where the smallest type is larger than 8 bits -- a minimum of 12, 14, or even 16 bits is fairly common. Windows CE does roughly the same: its smallest type (at least with Microsoft's compiler) is 16 bits. They do not, however, treat a char as 16 bits -- instead they take the (non-conforming) approach of simply not supporting a type named char at all.

Solution 2

TODAY, in the world of C++ on x86 processors, it is pretty safe to rely on one byte being 8 bits. Processors where the word size is not a power of 2 (8, 16, 32, 64) are very uncommon.

IT WAS NOT ALWAYS SO.

The Control Data 6600 (and its brothers) Central Processor used a 60-bit word, and could only address a word at a time. In one sense, a "byte" on a CDC 6600 was 60 bits.

The DEC-10 byte pointer hardware worked with arbitrary-size bytes. The byte pointer included the byte size in bits. I don't remember whether bytes could span word boundaries; I think they couldn't, which meant that you'd have a few waste bits per word if the byte size was not 3, 4, 9, or 18 bits. (The DEC-10 used a 36-bit word.)

Solution 3

Unless you're writing code that could be useful on a DSP, you're completely entitled to assume bytes are 8 bits. All the world may not be a VAX (or an Intel), but all the world has to communicate, share data, establish common protocols, and so on. We live in the internet age built on protocols built on octets, and any C implementation where bytes are not octets is going to have a really hard time using those protocols.

It's also worth noting that both POSIX and Windows have (and mandate) 8-bit bytes. That covers 100% of interesting non-embedded machines, and these days a large portion of non-DSP embedded systems as well.

Solution 4

From Wikipedia:

The size of a byte was at first selected to be a multiple of existing teletypewriter codes, particularly the 6-bit codes used by the U.S. Army (Fieldata) and Navy. In 1963, to end the use of incompatible teleprinter codes by different branches of the U.S. government, ASCII, a 7-bit code, was adopted as a Federal Information Processing Standard, making 6-bit bytes commercially obsolete. In the early 1960s, AT&T introduced digital telephony first on long-distance trunk lines. These used the 8-bit µ-law encoding. This large investment promised to reduce transmission costs for 8-bit data. The use of 8-bit codes for digital telephony also caused 8-bit data "octets" to be adopted as the basic data unit of the early Internet.

Solution 5

As an average programmer on mainstream platforms, you do not need to worry too much about one byte not being 8 bit. However, I'd still use the CHAR_BIT constant in my code and assert (or better static_assert) any locations where you rely on 8 bit bytes. That should put you on the safe side.

(I am not aware of any relevant platform where it doesn't hold true).

View more solutions

29,034

Author by

Xeo

Game Programmer, Bookworm, Japanese Culture Fanatic, C++ Lover and Template Hacker. I'm usually found hanging out in the Lounge, where the cool kids are. Sanity is just a mask.

Updated on September 12, 2020

Comments

Xeo almost 4 years

All the time I read sentences like

don't rely on 1 byte being 8 bit in size

use CHAR_BIT instead of 8 as a constant to convert between bits and bytes

et cetera. What real life systems are there today, where this holds true? _{(I'm not sure if there are differences between C and C++ regarding this, or if it's actually language agnostic. Please retag if neccessary.)}
Fred Foo about 13 years

Besides being safe, CHAR_BIT is self-documenting. And I learned on SO that some embedded platforms apparently have 16-bit char.
Jerry Coffin about 13 years

Strings on the CDC were normally stored 10 bit characters to the word though, so it's much more reasonable to treat it as having a 6-bit byte (with strings normally allocated in 10-byte chunks). Of course, from a viewpoint of C or C++, a 6-bit byte isn't allowed though, so you'd have had to double them up and use a 12-bit word as "byte" (which would still work reasonably well -- the PPUs were 12-bit processors, and communication between the CPU and PPUs was done in 12-bit chunks.
John R. Strohm about 13 years

When I was doing 6600, during my undergrad days, characters were still only 6 bits. PASCAL programmers had to be aware of the 12-bit PP word size, though, because end-of-line only occurred at 12-bit boundaries. This meant that there might or might not be a blank after the last non-blank character in the line, and I'm getting a headache just thinking about it, over 30 years later.
John R. Strohm about 13 years

Yes, and there are still a few 24-bit DSPs around.
Xeo about 13 years

I'll accept this answer because it puts everything important into one place. Maybe also add that bit from larsmans comment that CHAR_BIT is also self-documenting, which also made me use it now. I like self-documenting code. :) Thanks everyone for their answers.
Nawaz about 13 years

Could you please quote where in the C89 Standard it says char must be minimum of 8 bits?
Scott C Wilson about 13 years

Holy cow what a blast from the past! +1 for the memories!
David Hammen about 13 years

@Nawaz: I don't have C89 handy, but C99 section 5.2.4.2.1 says regarding the values in <limits.h> that "implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign." -- and then says that CHAR_BIT is 8. In other words, larger values are compliant, smaller ones are not.
Nawaz about 13 years

@David: Hmm, I had read that, but didn't interpret it that way. Thanks for making me understand that.
R.. GitHub STOP HELPING ICE almost 13 years

Wow +1 for teaching me something new about how broken WinCE is...
curiousguy over 12 years

What real problems have you seen when a machine where a byte is not an octet communicates on the Internet?
Wiz over 12 years

I don't think that networking will be much of a problem. The very low level calls should be taking care of the encoding details. Any normal program should not be affected.
R.. GitHub STOP HELPING ICE over 12 years

They can't. getc and putc have to preserve unsigned char values round-trip, which means you can't just have "extra bits" in char that don't get read/written.
Andreas Spindler over 11 years

They probably use C99's uint8_t. Since POSIX requires it for all socket functions it should be available (although C99 does not require uint8_t to be defined).
R.. GitHub STOP HELPING ICE over 11 years

uint8_t cannot exist if char is larger than 8 bits, because then uint8_t would have padding bits, which are not allowed.
A. H. over 10 years

there is something seriously wrong with Windows CE
Barmar over 10 years

While you can technically do anything you want when implementing a compiler, in a practical sense you need to conform to the operating system's ABI, and this generally forces all compilers for a particular system to use the same data representations.
AnT stands with Russia over 10 years

@Barmar: The need to conform to the operating systems ABI applies to interface data formats only. It does not impose any limitations onto the internal data formats of the implementation. The conformance can be (and typically is) achieved by using properly selected (and possible non-standard) types to describe the interface. For example, boolean type of Windows API (hiding behind BOOL) is different from bool of C++ or C. That does not create any problems for implementations.
Barmar over 10 years

Many APIs and ABIs are specified in terms of standard C data types, rather than abstract types. POSIX has some abstract types (e.g. size_t), but makes pretty liberal use of char and int as well. The ABI for particular POSIX implementations must then specify how these are represented so that interfaces will be compatible across implementations (you aren't required to compiler applications with the same implementation as the OS).
AnT stands with Russia over 10 years

@Barmar: That is purely superficial. It is not possible to specify ABIs in terms of truly standard language-level types. Standard types are flexible by definition, while ABI interface types are frozen. If some ABI uses standard type names in its specification, it implies (and usually explicitly states) that these types are required to have some specific frozen representation. Writing header files in terms of standard types for such ABIs will only work for those specific implementation that adhere to the required data format.
AnT stands with Russia over 10 years

Note that for the actual implementation "ABI in terms of standard types" will simply mean that some header files are written in therms of standard types. However, this does not in any way preclude the implementation from changing the representation of standard types. The implementation just has to remember that those header files have to be rewritten in terms of some other types (standard or not) to preserve the binary compatibility.
AnT stands with Russia over 10 years

For example, today I specify some ABI in terms of type int and presume (and explicitly state) that in this ABI int has 32 bits. Tomorrow my compiler gets significantly upgraded and its int changes from 32 bits to 64 bits. To preserve binary compatibility all I have to do in this case is replace int with int32_t in that ABI's header files. I don't even have to change the documentation of the ABI, since it explicitly states that it expects 32-bit int.
atzz over 10 years

@Jerry, you sure about char and WinCE? I wrote a bit for WinCE 5.0 /x86 and /ARM; there was nothing wrong with char type. What they did is remove char-sized versions of Win32 API (so GetWindowTextW is there but GetWindowTextA is not etc.)
Jerry Coffin over 10 years

@atzz: Availability (or lack of it) of char obviously depends on the compiler, not the OS itself. I (at least think I) remember one of the early compilers for CE lacking char, but it's been quite a while since I wrote any code for CE, so I can't really comment on anything current (or close to it).
N8allan about 10 years

I realize that CHAR_BIT is meant to represent the byte size, but the beef I have with that term is that it really has less to do with chars and more to do with byte length. A newbie dev will likely read CHAR_BIT and think it has something to do with using UTF8 or something like that. It's an unfortunate piece of legacy IMO.
SamB over 9 years

@AndreyT: not in C++, you can't ...
AnT stands with Russia over 9 years

@SamB: Huh? I "can't" what exactly?
jforberg almost 9 years

This is not an answer to the question, just a vaguely related historical note.
jfs about 8 years

@R..: $7.20.1.1.2 (c11) says explicitly that there are no padding bits in uintN_t. $7.20.1.1.3 says "these types are optional." $3.6 defines byte as: "addressable unit of data storage large enough to hold any member of the basic character set of the execution environment" (I don't see the word "smallest" in the definition). There is a notion of internal vs. trailing padding. Can uint8_t have a trailing padding? Is there a requirement that uint8_t object is at least CHAR_BIT? (as it is with _Bool type).
R.. GitHub STOP HELPING ICE about 8 years

@J.F.Sebastian: I have no idea where your notion of "trailing padding" came from or what it would mean. Per Representation of Types all objects have a representation which is an overlaid array unsigned char[sizeof(T)] which may consist partly of padding.
jfs about 8 years

@R.. Have you tried to look up the phrase "trailing padding" in the current C standard? (I did it using n1570 draft)
R.. GitHub STOP HELPING ICE about 8 years

@J.F.Sebastian: Yes. It's only used in relation to structures; it means padding after any of the members that exists as part of bringing the total struct size up to the total size it needs to be (usually just enough to increase the size to a multiple of its alignment). It has nothing to do with non-aggregate types.
too honest for this site over 6 years

"TODAY, in the world of C++ on x86 processors" - You might want to talk to TI, Analog Devices (which have 16 bit DSPs), Freescale/NXP (24 bit DSPs), ARM, MIPS (both not x86), etc. In fact x86 is a minority of architectures and devices sold. But yes, a binary digital computer hardly has **trinary**(/etc.) digits.
mtraceur about 6 years

@R.. One thing I don't get about your "they can't [communicate on the internet]" comment that I don't get, is that you reference getc and putc, but are those strongly relevant to the question of accessing the internet? Doesn't almost everything in the world access the internet through interfaces outside of the standard C library? Last I checked, you couldn't even get a stdio.h compatible object pointing to a network connection without first going through system-specific interfaces, could you? So is there any reason why details of getc/etc would preclude access to the internet?
mtraceur about 6 years

@R.. I also see no conceptual problem with a hypothetical system avoiding dropping higher bits from a char, so long as at the lowest level interface they either require data reads/writes in multiples of 8-bits at a time (2 12-bit bytes would give you a clean 3-octet boundary, 8 9-bit bytes would give you a clean 9-octet boundary, etc) or buffered accordingly. I suspect such systems do not exist due to lack of demand, but is there any reason why it'd be an impossibility?
mtraceur about 6 years

Lest my comments give the opposite impression, though, +1 to this answer, because it explains why it's reasonable to expect an 8-bit byte in all of the circumstances when it is, which really helps answering the fundamental question of "when should I worry about bytes/char being anything other than an octet" in a way that mere examples can't.