Reading in csv file as dataframe from hdfs

23,702

Solution 1

I know next to nothing about hdfs, but I wonder if the following might work:

with hd.open("/home/file.csv") as f:
    df =  pd.read_csv(f)

I assume read_csv works with a file handle, or in fact any iterable that will feed it lines. I know the numpy csv readers do.

pd.read_csv("/home/file.csv") would work if the regular Python file open works - i.e. it reads the file a regular local file.

with open("/home/file.csv") as f: 
    print f.read()

But evidently hd.open is using some other location or protocol, so the file is not local. If my suggestion doesn't work, then you (or we) need to dig more into the hdfs documentation.

Solution 2

you can use the following code to read csv from hdfs

import pandas as pd
import pyarrow as pa
hdfs_config = {
     "host" : "XXX.XXX.XXX.XXX",
     "port" : 8020,
     "user" : "user"
}
fs = pa.hdfs.connect(hdfs_config['host'], hdfs_config['port'], 
user=hdfs_config['user'])
df=pd.read_csv(fs.open("/home/file.csv"))
Share:
23,702
lordingtar
Author by

lordingtar

I love python, nlp and graph algorithms, and every day I strive to get better at them.

Updated on July 20, 2022

Comments

  • lordingtar
    lordingtar almost 2 years

    I'm using pydoop to read in a file from hdfs, and when I use:

    import pydoop.hdfs as hd
    with hd.open("/home/file.csv") as f:
        print f.read()
    

    It shows me the file in stdout.

    Is there any way for me to read in this file as dataframe? I've tried using pandas' read_csv("/home/file.csv"), but it tells me that the file cannot be found. The exact code and error is:

    >>> import pandas as pd
    >>> pd.read_csv("/home/file.csv")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
        return _read(filepath_or_buffer, kwds)
      File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
        parser = TextFileReader(filepath_or_buffer, **kwds)
      File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
        self._make_engine(self.engine)
      File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
        self._engine = CParserWrapper(self.f, **self.options)
      File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
        self._reader = _parser.TextReader(src, **kwds)
      File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
      File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
    IOError: File /home/file.csv does not exist