How to Convert Pandas Data Frame Schema
df=df.astype(str)
will convert all of the data in a pandas dataframe in strings, with object
dtypes using the built-in astype() method
You can also change the type of a single column, for example df['Column4'] = df['Column4'].astype(str)
.
All you need to do is to change the type of your dataframe or a subset of its columns before parquet_writer.write_table(table)
. Altogether, your code would look like this.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
csv_file = 'C:/input.csv'
parquet_file = 'C:/putput.parquet'
chunksize = 100_000
def convert(df):
df['Column4'] = df['Column4'].astype(str)
return df
csv_stream = pd.read_csv(csv_file, sep=',', chunksize=chunksize, low_memory=False, encoding="ISO-8859-1")
for i, chunk in enumerate(csv_stream):
print("Chunk", i)
if i == 0:
converted = convert(chunk)
parquet_schema = pa.Table.from_pandas(df=converted).schema
# Open a Parquet file for writing
parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
# Write CSV chunk to the parquet file
converted = convert(chunk)
table = pa.Table.from_pandas(converted, parquet_schema)
parquet_writer.write_table(table)
parquet_writer.close()
Related videos on Youtube
Comments
-
Hamed almost 2 years
I am reading a CSV file with
pandas.read_csv
and it detects the schema automatically which is likeColumn1: string Column2: string Column3: string Column4: int64 Column5: double Column6: double __index_level_0__: int64
Then, I am trying to write it with
pyarrow.parquet.write_table
as a Parquet table. However, I want to use the following schema for the new parquet fileColumn1: string Column2: string Column3: string Column4: string Column5: string Column6: string __index_level_0__: int64
But I get an error saying "Table schema does not match schema used to create file". Here is the piece of code I have used to convert a CSV file to a Parquet file borrowed from here
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq csv_file = 'C:/input.csv' parquet_file = 'C:/putput.parquet' chunksize = 100_000 csv_stream = pd.read_csv(csv_file, sep=',', chunksize=chunksize, low_memory=False, encoding="ISO-8859-1") for i, chunk in enumerate(csv_stream): print("Chunk", i) if i == 0: # Guess the schema of the CSV file from the first chunk # parquet_schema = pa.Table.from_pandas(df=chunk).schema parquet_schema = pa.schema([ ('c1', pa.string()), ('c2', pa.string()), ('c3', pa.string()), ('c4', pa.string()), ('c5', pa.string()), ('c6', pa.string()) ]) # Open a Parquet file for writing parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy') # Write CSV chunk to the parquet file table = pa.Table.from_pandas(chunk, schema=parquet_schema) parquet_writer.write_table(table) parquet_writer.close()
-
Wes McKinney over 5 yearsWhat is the schema of
table
right before it is written (where you get the error)? -
Hamed over 5 yearsIt is the same as the automatically detected schema
-
Wes McKinney over 5 yearsOK,
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
should give you a table with the coerced schema. If not, it's a bug and please open a JIRA issue if possible
-
-
Hamed over 5 yearsThanks, how is it possible to convert only the type of a column in a dataframe?
-
G. Anderson over 5 yearsExactly the same way, luckily
df['col'] = df['col'].astype(str)
-
Wes McKinney over 5 yearsIt's good that there's a workaround but I'm concerned there's a bug in pyarrow that needs to be reported