how to save data in HDFS with spark?

12,700

The path has to be a directory in HDFS.

For example, if you want to save the files inside a folder named myNewFolder under the root / path in HDFS.

The path to use would be hdfs://namenode_ip:port/myNewFolder/

On execution of the spark job this directory myNewFolder will be created.

The datanode data directory which is given for the dfs.datanode.data.dir in hdfs-site.xml is used to store the blocks of the files you store in HDFS, should not be referenced as HDFS directory path.

Share:
12,700
Yassir S
Author by

Yassir S

Updated on June 29, 2022

Comments

  • Yassir S
    Yassir S almost 2 years

    I want to using Spark Streaming to retrieve data from Kafka. Now, I want to save my data in a remote HDFS. I know that I have to use the function saveAsText. However, I don't know precisely how to specify the path.

    Is that correct if I write this:

    myDStream.foreachRDD(frm->{
        frm.saveAsTextFile("hdfs://ip_addr:9000//home/hadoop/datanode/myNewFolder");
    });
    

    where ip_addr is the ip address of my hdfs remote server. /home/hadoop/datanode/ is the DataNode HDFS directory created when I installed hadoop (I don't know if I have to specify this directory). And, myNewFolder is the folder where I want to save my data.

    Thanks in advance.

    Yassir

  • Yassir S
    Yassir S about 7 years
    ok thanks, it is very clear. In your case, what would be the value of the 'port' ?
  • franklinsijo
    franklinsijo about 7 years
    You have used 9000 as the port, the same has to be used. This is the RPC port defined in core-site.xml for property fs.defaultFS.
  • Yassir S
    Yassir S about 7 years
    I did why you told me to do but I don't see any new file created and spark does not return any error..
  • franklinsijo
    franklinsijo about 7 years
    Strange! Is the folder created? Also please check you have access to the remote hdfs
  • franklinsijo
    franklinsijo about 7 years
    What was the issue?
  • Yassir S
    Yassir S about 7 years
    my new folder and files don't appear in linux commands. But if I connect through a browser to master:50070 I can see them
  • franklinsijo
    franklinsijo about 7 years
    These are HDFS files, you should use HDFS specific commands. Refer here