how to save data in HDFS with spark?

hadoop apache-spark hdfs spark-streaming

12,700

The path has to be a directory in HDFS.

For example, if you want to save the files inside a folder named myNewFolder under the root / path in HDFS.

The path to use would be hdfs://namenode_ip:port/myNewFolder/

On execution of the spark job this directory myNewFolder will be created.

The datanode data directory which is given for the dfs.datanode.data.dir in hdfs-site.xml is used to store the blocks of the files you store in HDFS, should not be referenced as HDFS directory path.

12,700

Author by

Yassir S

Updated on June 29, 2022

Comments

Yassir S almost 2 years
I want to using Spark Streaming to retrieve data from Kafka. Now, I want to save my data in a remote HDFS. I know that I have to use the function saveAsText. However, I don't know precisely how to specify the path.

Is that correct if I write this:
```
myDStream.foreachRDD(frm->{
    frm.saveAsTextFile("hdfs://ip_addr:9000//home/hadoop/datanode/myNewFolder");
});
```
where ip_addr is the ip address of my hdfs remote server. /home/hadoop/datanode/ is the DataNode HDFS directory created when I installed hadoop (I don't know if I have to specify this directory). And, myNewFolder is the folder where I want to save my data.

Thanks in advance.

Yassir
Yassir S about 7 years

ok thanks, it is very clear. In your case, what would be the value of the 'port' ?
franklinsijo about 7 years

You have used 9000 as the port, the same has to be used. This is the RPC port defined in core-site.xml for property fs.defaultFS.
Yassir S about 7 years

I did why you told me to do but I don't see any new file created and spark does not return any error..
franklinsijo about 7 years

Strange! Is the folder created? Also please check you have access to the remote hdfs
franklinsijo about 7 years

What was the issue?
Yassir S about 7 years

my new folder and files don't appear in linux commands. But if I connect through a browser to master:50070 I can see them
franklinsijo about 7 years

These are HDFS files, you should use HDFS specific commands. Refer here