Reading in csv file as dataframe from hdfs
Solution 1
I know next to nothing about hdfs
, but I wonder if the following might work:
with hd.open("/home/file.csv") as f:
df = pd.read_csv(f)
I assume read_csv
works with a file handle, or in fact any iterable that will feed it lines. I know the numpy
csv readers do.
pd.read_csv("/home/file.csv")
would work if the regular Python file open
works - i.e. it reads the file a regular local file.
with open("/home/file.csv") as f:
print f.read()
But evidently hd.open
is using some other location or protocol, so the file is not local. If my suggestion doesn't work, then you (or we) need to dig more into the hdfs
documentation.
Solution 2
you can use the following code to read csv from hdfs
import pandas as pd
import pyarrow as pa
hdfs_config = {
"host" : "XXX.XXX.XXX.XXX",
"port" : 8020,
"user" : "user"
}
fs = pa.hdfs.connect(hdfs_config['host'], hdfs_config['port'],
user=hdfs_config['user'])
df=pd.read_csv(fs.open("/home/file.csv"))
lordingtar
I love python, nlp and graph algorithms, and every day I strive to get better at them.
Updated on July 20, 2022Comments
-
lordingtar almost 2 years
I'm using pydoop to read in a file from hdfs, and when I use:
import pydoop.hdfs as hd with hd.open("/home/file.csv") as f: print f.read()
It shows me the file in stdout.
Is there any way for me to read in this file as dataframe? I've tried using pandas' read_csv("/home/file.csv"), but it tells me that the file cannot be found. The exact code and error is:
>>> import pandas as pd >>> pd.read_csv("/home/file.csv") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f return _read(filepath_or_buffer, kwds) File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__ self._make_engine(self.engine) File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine self._engine = CParserWrapper(self.f, **self.options) File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__ self._reader = _parser.TextReader(src, **kwds) File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246) File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111) IOError: File /home/file.csv does not exist