Parquet Binary Data type

10,859

Solution 1

Raw bytes are stored in Parquet either as a fixed-length byte array (FIXED_LEN_BYTE_ARRAY) or as a variable-length byte array (BYTE_ARRAY, also called binary). Fixed is used when you have values with a constant size, like a SHA1 hash value. Most of the time, the variable-length version is used.

Strings are encoded as variable-length binary with the UTF8 type annotation to indicate how to interpret the raw bytes back into a String. UTF8 is the only encoding supported in the format, but not every binary uses UTF8 because not all binary fields are storing string data.

Solution 2

There is no data type in parquet-column called BYTE_ARRAY. I saw their PrimitiveType in latest package but could not see it. Could not write byte[] in binary as well.

Share:
10,859
user1971133
Author by

user1971133

Updated on June 04, 2022

Comments

  • user1971133
    user1971133 almost 2 years

    I have a question regarding the Binary data type. I am trying to write a Parquet Schema for my MR job to create the Parquet file contrary to have Hive or Impala create one. I see some references to a Binary type which I do not see in Parquet

    Is binary an alias to BYTE_ARRAY?

    Also is UTF-8 a default encoding on Binary data types?

  • Mathews Sunny
    Mathews Sunny over 5 years
    Provide some more description or a link to more details
  • Mayank Thirani
    Mayank Thirani over 5 years
    Finally I could write byte[] in binary type/ fixed len byte array type which is supported by Primitive type in Parquet. github.com/apache/parquet-format where don't get confused with BYTE_ARRAY type. I think they mean to say is FIXED_LEN one as there is no BYTE_ARRAY specifically. BYTE_ARRAY corresponds to binary in Parquet