How does MySQL varchar know how many bytes indicate the length?

12,726

It happens at the time of definition. All length prefixes will be the same size in bytes for a particular VARCHAR column. The VARCHAR column will use 2 bytes or the VARCHAR column will use 1 byte, depending on the defined size in characters, and the character set.

All VARCHAR columns defined such that it might require more than 255 bytes use 2 bytes to store the size. MySQL isn't going to use 1 byte for some values in a column and 2 bytes for others.

MySQL documentation on CHAR and VARCHAR Types states this pretty clearly (emphasis mine):

A column uses one length byte if values require no more than 255 bytes, two length bytes if values may require more than 255 bytes.

If you declare a VARCHAR(255) column to use the utf8 character set, it's still going to use 2 bytes for the length prefix, not 1, since the length in bytes may be greater than 255 with utf8 characters.

Share:
12,726
Timo
Author by

Timo

Maintainability and good practices ought to be be a pit of success. I have a passion for authoring .NET class libraries that help achieve this.

Updated on September 28, 2022

Comments

  • Timo
    Timo over 1 year

    The MySQL manual and several StackOverflow answers make it clear that varchar uses:

    • 1 byte for varchars with 0-255 characters in them
    • 2 bytes for varchars with more than 255 characters in them.

    The first part makes sense. A single byte can store 256 different values, i.e. 0 through 255.

    What I am trying to figure out is how MySQL knows how many bytes indicate the length.

    Imagine a 255-char varchar starting with the following bytes: [255][w][o][r][d]~

    According to the manual, only the first byte is used to indicate the length in this scenario. When reading the field, MySQL will somehow have to know that this is the case here, and that the second byte is not part of the length.

    Now imagine a 256-char varchar starting with the following bytes: [255][1][w][o][r][d]~

    Now MySQL miraculously knows that it should interpret the first two bytes as the length, when reading the field.

    How does it distinguish? The only foolproof way I have come up with is to interpret only the first byte as length, then determine if the text length matches (in its current encoding), and if not, we know that the first two bytes must be the length.