Using pyarrow how do you append to parquet file?

34,089

Solution 1

I ran into the same issue and I think I was able to solve it using the following:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


chunksize=10000 # this is the number of lines

pqwriter = None
for i, df in enumerate(pd.read_csv('sample.csv', chunksize=chunksize)):
    table = pa.Table.from_pandas(df)
    # for the first chunk of records
    if i == 0:
        # create a parquet write object giving it an output file
        pqwriter = pq.ParquetWriter('sample.parquet', table.schema)            
    pqwriter.write_table(table)

# close the parquet writer
if pqwriter:
    pqwriter.close()

Solution 2

In your case the column name is not consistent, I made the column name consistent for three sample dataframes and the following code worked for me.

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


def append_to_parquet_table(dataframe, filepath=None, writer=None):
    """Method writes/append dataframes in parquet format.

    This method is used to write pandas DataFrame as pyarrow Table in parquet format. If the methods is invoked
    with writer, it appends dataframe to the already written pyarrow table.

    :param dataframe: pd.DataFrame to be written in parquet format.
    :param filepath: target file location for parquet file.
    :param writer: ParquetWriter object to write pyarrow tables in parquet format.
    :return: ParquetWriter object. This can be passed in the subsequenct method calls to append DataFrame
        in the pyarrow Table
    """
    table = pa.Table.from_pandas(dataframe)
    if writer is None:
        writer = pq.ParquetWriter(filepath, table.schema)
    writer.write_table(table=table)
    return writer


if __name__ == '__main__':

    table1 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
    table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
    table3 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
    writer = None
    filepath = '/tmp/verify_pyarrow_append.parquet'
    table_list = [table1, table2, table3]

    for table in table_list:
        writer = append_to_parquet_table(table, filepath, writer)

    if writer:
        writer.close()

    df = pd.read_parquet(filepath)
    print(df)

Output:

   one  three  two
0 -1.0   True  foo
1  NaN  False  bar
2  2.5   True  baz
0 -1.0   True  foo
1  NaN  False  bar
2  2.5   True  baz
0 -1.0   True  foo
1  NaN  False  bar
2  2.5   True  baz

Solution 3

Generally speaking, Parquet datasets consist of multiple files, so you append by writing an additional file into the same directory where the data belongs to. It would be useful to have the ability to concatenate multiple files easily. I opened https://issues.apache.org/jira/browse/PARQUET-1154 to make this possible to do easily in C++ (and therefore Python)

Share:
34,089
Merlin
Author by

Merlin

its all about me, status = 200 MLabs!

Updated on July 09, 2022

Comments

  • Merlin
    Merlin almost 2 years

    How do you append/update to a parquet file with pyarrow?

    import pandas as pd
    import pyarrow as pa
    import pyarrow.parquet as pq
    
    
     table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
     table3 = pd.DataFrame({'six': [-1, np.nan, 2.5], 'nine': ['foo', 'bar', 'baz'], 'ten': [True, False, True]})
    
    
    pq.write_table(table2, './dataNew/pqTest2.parquet')
    #append pqTest2 here?  
    

    There is nothing I found in the docs about appending parquet files. And, Can you use pyarrow with multiprocessing to insert/update the data.

  • Merlin
    Merlin over 6 years
    Pls include updating data. Maybe there is something in arrow, that might work.
  • Wes McKinney
    Wes McKinney over 6 years
    Please come to the mailing lists for Arrow and Parquet with your questions. Stack Overflow is not the best venue for getting support
  • Yury Kirienko
    Yury Kirienko over 6 years
    Of course, it depends on the data, but in my experience chunksize=10000 is too big. Chunk size values about a hundred work much faster for me in most cases
  • natbusa
    natbusa almost 5 years
    Is parquet-tools command parquet-merge not an option? - at least from the command line? (Disclaimer I haven't tried it yet)
  • hodisr
    hodisr almost 5 years
    The else after the if is unnecessary since you're writing to table in both cases.
  • xiaodai
    xiaodai over 4 years
    The parquet files appears as a single file on Windows sometimes. How do I view it as a folder on Windows?
  • Sergio Lucero
    Sergio Lucero about 4 years
    worked wonders for me. I added compression='gzip' when creating pqwriter.
  • HCSF
    HCSF about 4 years
    Is there a way to skip converting to pandas.DataFrame before converting it into Arrow.Table? Thanks.
  • Ibraheem Ibraheem
    Ibraheem Ibraheem about 4 years
    Well, according to the docs pyarrow.Table can convert from_arrays, from_batches or from_pandas. see arrow.apache.org/docs/python/generated/…
  • Michele Piccolini
    Michele Piccolini almost 4 years
    Thanks! To this date, the api for incrementally write parquets is really not well documented.
  • Michele Piccolini
    Michele Piccolini almost 4 years
    @YuryKirienko I get the best performance with chunksize=1e5. A best advice for people would be: benchmark with different values and see what's best for you.
  • natbusa
    natbusa over 2 years
    This solution works only if the writer is still open ... A better way is to put to files in a directory. pandas/pyarrow will append to a dataframe both files while reading the directory.
  • Contango
    Contango over 2 years
    Unfortunately, this cannot append to an existing .parquet file (see my answer that can). Reason: Once .close() is called, the file cannot be appended to, and before .close() is called, the .parquet file is not valid (will throw an exception due to a corrupted file as it's missing its binary footer). The answer from @Contango solves this.
  • Contango
    Contango over 2 years
    Unfortunately, this cannot append to an existing .parquet file (see my answer that can). Reason: Once .close() is called, the file cannot be appended to, and before .close() is called, the .parquet file is not valid (will throw an exception due to a corrupted file as it's missing its binary footer). The answer from @Contango solves this.