How to access hdfs by URI consisting of H/A namenodes in Spark which is outer hadoop cluster?

hadoop apache-spark hdfs

10,512

Solution 1

Suppose your nameservice is 'hadooptest', then set the hadoop configurations like below. You can get these information from hdfs-site.xml file of remote HA enabled HDFS.

sc.hadoopConfiguration.set("dfs.nameservices", "hadooptest")
sc.hadoopConfiguration.set("dfs.client.failover.proxy.provider.hadooptest", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
sc.hadoopConfiguration.set("dfs.ha.namenodes.hadooptest", "nn1,nn2")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.hadooptest.nn1", "10.10.14.81:8020")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.hadooptest.nn2", "10.10.14.82:8020")

After this, you can use the URL with 'hadooptest' like below.

test.write.orc("hdfs://hadooptest/tmp/test/r1")

check here for more information.

Solution 2

If you want to make a H/A HDFS cluster as your default config (mostly the case) that applies to every application started through spark-submit or spark-shell. you could write the cluster information into spark-defaults.conf.

sudo vim $SPARK_HOME/conf/spark-defaults.conf

And add the following lines. assuming your HDFS cluster name is hdfs-k8s

spark.hadoop.dfs.nameservices   hdfs-k8s
spark.hadoop.dfs.ha.namenodes.hdfs-k8s  nn0,nn1
spark.hadoop.dfs.namenode.rpc-address.hdfs-k8s.nn0 192.168.23.55:8020
spark.hadoop.dfs.namenode.rpc-address.hdfs-k8s.nn1 192.168.23.56:8020
spark.hadoop.dfs.client.failover.proxy.provider.hdfs-k8s    org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

It should work when your next application launched.

sc.addPyFile('hdfs://hdfs-k8s/user/root/env.zip')

10,512

Author by

Jungtaek Lim

Collaborator (committer) of Jedis, committer and PMC member of Apache Storm, committer of Apache Spark, contributor of Apache Calcite, Apache Ambari, Apache Zeppelin, Redis, AsyncHBase, Fluentd, and so on. • http://github.com/heartsavior

Updated on June 09, 2022

Comments

Jungtaek Lim almost 2 years

Now I have some Spark applications which store output to HDFS.

Since our hadoop cluster is consisting of namenode H/A, and spark cluster is outside of hadoop cluster (I know it is something bad) I need to specify HDFS URI to application so that it can access HDFS.

But it doesn't recognize name service so I can only give one of namenode's URI, and if it fails, modify configuration file and try again.

Accessing Zookeeper for revealing active seems to very annoying, so I'd like to avoid.

Could you suggest any alternatives?
- Sandeep Singh almost 9 years
  
  you can use Active NameNode URI to connect. It should be like this : hdfs://hostname:8020
- Jungtaek Lim almost 9 years
  
  Sorry but I already did it. I want to know how to find active namenode without manual revealing.
- Sandeep Singh almost 9 years
  
  http://<namenode_hostname>:50070/dfshealth.jsp. here you will get which is on active state.
- Sandeep Singh almost 9 years
  
  You can use this command too hadoop dfsadmin -report to get the status.
- Jungtaek Lim almost 9 years
  
  Thanks for additional information, but I'm trying to avoid "manual" revealing. Spark Application should find active namenode automatically.
Bourne over 5 years

It should be 2 >> so it doesn't overwrite the whole file