Python write to hdfs file

12,116

Solution 1

try HDFS liberary.. its really good You can use write(). https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.write

Example:

to create connection:

from hdfs import InsecureClient
client = InsecureClient('http://host:port', user='ann')

from json import dump, dumps
records = [
  {'name': 'foo', 'weight': 1},
  {'name': 'bar', 'weight': 2},
]

# As a context manager:
with client.write('data/records.jsonl', encoding='utf-8') as writer:
  dump(records, writer)

# Or, passing in a generator directly:
client.write('data/records.jsonl', data=dumps(records), encoding='utf-8')

For CSV you can do

import pandas as pd
df=pd.read.csv("file.csv")
with client_hdfs.write('path/output.csv', encoding = 'utf-8') as writer:
  df.to_csv(writer)

Solution 2

What's wrong with other answers

They use WebHDFS, which is not enabled by default, and insecure without Kerberos or Apache Knox.

This is what the upload function of that hdfs library you linked to uses.

Native (more secure) ways to write to HDFS using Python

You can use pyspark.

Example - How to write pyspark dataframe to HDFS and then how to read it back into dataframe?


snakebite has been mentioned, but it doesn't write files


pyarrow has a FileSystem.open() function that should be able to write to HDFS as well, though I've not tried.

Share:
12,116
nishant
Author by

nishant

Updated on July 26, 2022

Comments

  • nishant
    nishant almost 2 years

    What is the best way to create/write/update a file in remote HDFS from local python script?

    I am able to list files and directories but writing seems to be a problem.

    I have searched hdfs and snakebite but none of them give a clean way to do this.

  • OneCricketeer
    OneCricketeer about 4 years
    Might want to update Hadoop link with latest version
  • Itération 122442
    Itération 122442 over 3 years
    Why do you use json library ? What is the reason ? How do I do if I do not have json but CSV ?
  • Andy_101
    Andy_101 over 3 years
    i have added that to the answer
  • Andy_101
    Andy_101 over 3 years
    that is true..but if they are writing to hdfs without spark their is no other very good option.