Python write to hdfs file
Solution 1
try HDFS liberary.. its really good You can use write(). https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.write
Example:
to create connection:
from hdfs import InsecureClient
client = InsecureClient('http://host:port', user='ann')
from json import dump, dumps
records = [
{'name': 'foo', 'weight': 1},
{'name': 'bar', 'weight': 2},
]
# As a context manager:
with client.write('data/records.jsonl', encoding='utf-8') as writer:
dump(records, writer)
# Or, passing in a generator directly:
client.write('data/records.jsonl', data=dumps(records), encoding='utf-8')
For CSV you can do
import pandas as pd
df=pd.read.csv("file.csv")
with client_hdfs.write('path/output.csv', encoding = 'utf-8') as writer:
df.to_csv(writer)
Solution 2
What's wrong with other answers
They use WebHDFS, which is not enabled by default, and insecure without Kerberos or Apache Knox.
This is what the upload
function of that hdfs
library you linked to uses.
Native (more secure) ways to write to HDFS using Python
You can use pyspark
.
Example - How to write pyspark dataframe to HDFS and then how to read it back into dataframe?
snakebite
has been mentioned, but it doesn't write files
pyarrow
has a FileSystem.open() function that should be able to write to HDFS as well, though I've not tried.
nishant
Updated on July 26, 2022Comments
-
nishant almost 2 years
What is the best way to create/write/update a file in remote HDFS from local python script?
I am able to list files and directories but writing seems to be a problem.
I have searched hdfs and snakebite but none of them give a clean way to do this.
-
OneCricketeer about 4 yearsMight want to update Hadoop link with latest version
-
Itération 122442 over 3 yearsWhy do you use json library ? What is the reason ? How do I do if I do not have json but CSV ?
-
Andy_101 over 3 yearsi have added that to the answer
-
Andy_101 over 3 yearsthat is true..but if they are writing to hdfs without spark their is no other very good option.