how to save data in HDFS with spark?
The path has to be a directory in HDFS.
For example, if you want to save the files inside a folder named myNewFolder
under the root /
path in HDFS.
The path to use would be hdfs://namenode_ip:port/myNewFolder/
On execution of the spark job this directory myNewFolder
will be created.
The datanode data directory which is given for the dfs.datanode.data.dir
in hdfs-site.xml
is used to store the blocks of the files you store in HDFS, should not be referenced as HDFS directory path.
Yassir S
Updated on June 29, 2022Comments
-
Yassir S almost 2 years
I want to using Spark Streaming to retrieve data from Kafka. Now, I want to save my data in a remote HDFS. I know that I have to use the function saveAsText. However, I don't know precisely how to specify the path.
Is that correct if I write this:
myDStream.foreachRDD(frm->{ frm.saveAsTextFile("hdfs://ip_addr:9000//home/hadoop/datanode/myNewFolder"); });
where
ip_addr
is the ip address of my hdfs remote server./home/hadoop/datanode/
is the DataNode HDFS directory created when I installed hadoop (I don't know if I have to specify this directory). And,myNewFolder
is the folder where I want to save my data.Thanks in advance.
Yassir
-
Yassir S about 7 yearsok thanks, it is very clear. In your case, what would be the value of the 'port' ?
-
franklinsijo about 7 yearsYou have used 9000 as the port, the same has to be used. This is the RPC port defined in
core-site.xml
for propertyfs.defaultFS
. -
Yassir S about 7 yearsI did why you told me to do but I don't see any new file created and spark does not return any error..
-
franklinsijo about 7 yearsStrange! Is the folder created? Also please check you have access to the remote hdfs
-
franklinsijo about 7 yearsWhat was the issue?
-
Yassir S about 7 yearsmy new folder and files don't appear in linux commands. But if I connect through a browser to master:50070 I can see them
-
franklinsijo about 7 yearsThese are HDFS files, you should use HDFS specific commands. Refer here