How to read a Parquet file into Pandas DataFrame?
Solution 1
pandas 0.21 introduces new functions for Parquet:
import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')
or
import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')
The above link explains:
These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).
Solution 2
Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/
There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python
It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv
for example.
Solution 3
Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe
The code is simple, just type:
import pyarrow.parquet as pq
df = pq.read_table(source=your_file_path).to_pandas()
For more information, see the document from Apache pyarrow Reading and Writing Single Files
Solution 4
Parquet
Step 1: Data to play with
df = pd.DataFrame({
'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
'marks': [20,10,22,21,22],
})
Step 2: Save as Parquet
df.to_parquet('sample.parquet')
Step 3: Read from Parquet
df = pd.read_parquet('sample.parquet')
Solution 5
Considering the .parquet
file named data
parquet_file = '../data.parquet'
open( parquet_file, 'w+' )
Then use pandas.to_parquet
(this function requires either the fastparquet or pyarrow library)
parquet_df.to_parquet(parquet_file)
Then, use pandas.read_parquet()
to get a dataframe
new_parquet_df = pd.read_parquet(parquet_file)
Related videos on Youtube
Daniel Mahler
Updated on July 08, 2022Comments
-
Daniel Mahler almost 2 years
How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.
I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.
-
mdurant over 8 yearsDo you happen to have the data openly available? My branch of python-parquet github.com/martindurant/parquet-python/tree/py3 had a pandas reader in parquet.rparquet, you could try it. There are many parquet constructs it cannot handle.
-
XValidated about 8 yearsWait for the Apache Arrow project that the Pandas author Wes Mckinney is part of. wesmckinney.com/blog/pandas-and-apache-arrow After it is done, users should be able to read in Parquet file directly from Pandas.
-
sroecker about 7 yearsSince the question is closed as off-topic (but still the first result on Google) I have to answer in a comment.. You can now use pyarrow to read a parquet file and convert it to a pandas DataFrame:
import pyarrow.parquet as pq; df = pq.read_table('dataset.parq').to_pandas()
-
user48956 almost 7 yearsKinda annoyed that this question was closed. Spark and parquet are (still) relatively poorly documented. Am also looking for the answer to this.
-
asmaier almost 7 yearsHave a look at github.com/dask/fastparquet . For an introduction see continuum.io/blog/developer-blog/introducing-fastparquet .
-
ogrisel over 6 yearsBoth the fastparquet and pyarrow libraries make it possible to read a parquet file into a pandas dataframe: github.com/dask/fastparquet and arrow.apache.org/docs/python/parquet.html
-
Andras Deak -- Слава Україні over 6 years@ogrisel it's open now
-
MichaelChirico over 6 years@DanielMahler consider updating the accepted answer
-
-
bluszcz over 7 yearsActually there is pyarrow which allows both reads / writes: pyarrow.readthedocs.io/en/latest/parquet.html
-
snooze_bear about 7 yearsI get a permission denied error when I try to follow your link, @bluszcz -- do you have an alternate?
-
ogrisel over 6 yearsparquet-python is much slower than alternatives such as fastparquet et pyarrow: arrow.apache.org/docs/python/parquet.html
-
ogrisel over 6 years
pd.read_parquet
is now part of pandas. The other answer should be marked as valid. -
Catbuilts over 5 yearsFor most of my data, 'fastparquet' is a bit faster. Just in case
pd.read_parquet()
returns a problem with Snappy Error, runconda install python-snappy
to install snappy. -
Seb over 5 yearsI found pyarrow to be too difficult to install (both on my local windows machine and on a cloud linux machine). Even after the python-snappy fix, there were additional issues with the compiler as well as the error module 'pyarrow' has no attribute 'compat'. fastparquet had no issues at all.
-
Khan almost 5 years@Catbuilts You can use gzip if you don't have snappy.
-
wawawa over 3 yearscan 'fastparquet' read ',snappy.parquet' file?
-
Mark Z. about 3 yearsI had the opposite experience vs. @Seb. fastparquet had a bunch of issues, pyarrow was simple pip install and off I went