How to Convert Pandas Data Frame Schema

11,400

df=df.astype(str) will convert all of the data in a pandas dataframe in strings, with object dtypes using the built-in astype() method

You can also change the type of a single column, for example df['Column4'] = df['Column4'].astype(str).

All you need to do is to change the type of your dataframe or a subset of its columns before parquet_writer.write_table(table). Altogether, your code would look like this.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

csv_file = 'C:/input.csv'
parquet_file = 'C:/putput.parquet'
chunksize = 100_000

def convert(df):
    df['Column4'] = df['Column4'].astype(str)
    return df

csv_stream = pd.read_csv(csv_file, sep=',', chunksize=chunksize, low_memory=False, encoding="ISO-8859-1")

for i, chunk in enumerate(csv_stream):
    print("Chunk", i)
    if i == 0:            
        converted = convert(chunk)
        parquet_schema = pa.Table.from_pandas(df=converted).schema

        # Open a Parquet file for writing
        parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')

    # Write CSV chunk to the parquet file
    converted = convert(chunk)
    table = pa.Table.from_pandas(converted, parquet_schema)
    parquet_writer.write_table(table)

parquet_writer.close()
Share:
11,400

Related videos on Youtube

Hamed
Author by

Hamed

I love software development, architecture, and management.

Updated on May 28, 2022

Comments

  • Hamed
    Hamed almost 2 years

    I am reading a CSV file with pandas.read_csv and it detects the schema automatically which is like

    Column1: string
    Column2: string
    Column3: string
    Column4: int64
    Column5: double
    Column6: double
    __index_level_0__: int64
    

    Then, I am trying to write it with pyarrow.parquet.write_table as a Parquet table. However, I want to use the following schema for the new parquet file

    Column1: string
    Column2: string
    Column3: string
    Column4: string
    Column5: string
    Column6: string
    __index_level_0__: int64
    

    But I get an error saying "Table schema does not match schema used to create file". Here is the piece of code I have used to convert a CSV file to a Parquet file borrowed from here

    import pandas as pd
    import pyarrow as pa
    import pyarrow.parquet as pq
    
    csv_file = 'C:/input.csv'
    parquet_file = 'C:/putput.parquet'
    chunksize = 100_000
    
    csv_stream = pd.read_csv(csv_file, sep=',', chunksize=chunksize, low_memory=False, encoding="ISO-8859-1")
    
    for i, chunk in enumerate(csv_stream):
        print("Chunk", i)
        if i == 0:
            # Guess the schema of the CSV file from the first chunk
            # parquet_schema = pa.Table.from_pandas(df=chunk).schema
            parquet_schema = pa.schema([
                ('c1', pa.string()),
                ('c2', pa.string()),
                ('c3', pa.string()),
                ('c4', pa.string()),
                ('c5', pa.string()),
                ('c6', pa.string())
            ])
            # Open a Parquet file for writing
            parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
        # Write CSV chunk to the parquet file
        table = pa.Table.from_pandas(chunk, schema=parquet_schema)
        parquet_writer.write_table(table)
    
    parquet_writer.close()
    
    • Wes McKinney
      Wes McKinney over 5 years
      What is the schema of table right before it is written (where you get the error)?
    • Hamed
      Hamed over 5 years
      It is the same as the automatically detected schema
    • Wes McKinney
      Wes McKinney over 5 years
      OK, table = pa.Table.from_pandas(chunk, schema=parquet_schema) should give you a table with the coerced schema. If not, it's a bug and please open a JIRA issue if possible
  • Hamed
    Hamed over 5 years
    Thanks, how is it possible to convert only the type of a column in a dataframe?
  • G. Anderson
    G. Anderson over 5 years
    Exactly the same way, luckily df['col'] = df['col'].astype(str)
  • Wes McKinney
    Wes McKinney over 5 years
    It's good that there's a workaround but I'm concerned there's a bug in pyarrow that needs to be reported