HDFS access from remote host through Java API, user authentication
After some studying I came to the following solution:
- I don't actually need the full Kerberos solution, it is enough currently that clients can run HDFS requests from any user. Environment itself is considered secure.
- This gives me solution based on hadoop UserGroupInformation class. In future I can extend it to support Kerberos.
Sample code probably useful for people both for 'fake authentication' and remote HDFS access:
package org.myorg;
import java.security.PrivilegedExceptionAction;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.security.UserGroupInformation;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
public class HdfsTest {
public static void main(String args[]) {
try {
UserGroupInformation ugi
= UserGroupInformation.createRemoteUser("hbase");
ugi.doAs(new PrivilegedExceptionAction<Void>() {
public Void run() throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://1.2.3.4:8020/user/hbase");
conf.set("hadoop.job.ugi", "hbase");
FileSystem fs = FileSystem.get(conf);
fs.createNewFile(new Path("/user/hbase/test"));
FileStatus[] status = fs.listStatus(new Path("/user/hbase"));
for(int i=0;i<status.length;i++){
System.out.println(status[i].getPath());
}
return null;
}
});
} catch (Exception e) {
e.printStackTrace();
}
}
}
Useful reference for those who have a similar problem:
- Cloudera blog post "Authorization and Authentication In Hadoop". Short, focused on simple explanation of hadoop security approaches. No information specific to Java API solution but good for basic understanding of the problem.
UPDATE:
Alternative for those who uses command line hdfs
or hadoop
utility without local user needed:
HADOOP_USER_NAME=hdfs hdfs fs -put /root/MyHadoop/file1.txt /
What you actually do is you read local file in accordance to your local permissions but when placing file on HDFS you are authenticated like user hdfs
.
This has pretty similar properties to API code illustrated:
- You don't need
sudo
. - You don't need actually appropriate local user 'hdfs'.
- You don't need to copy anything or change permissions because of previous points.
Roman Nikitchenko
Big data and Telecom domain expert. Current focus is on data storage and processing solutions based on Hadoop / HBase infrastructure. Lot of experience as software engineer / technical leader with areas such as networking, VoIP, PWE, Ethernet, SONET, ATM, TDM, multithreading / multitasking / distributed programming. OS: Linux, VxWorks, FreeBSD, RTXC. Broad technology range starting from lot of experience with embedded devices and up to distributed enterprise systems based on Java technologies.
Updated on April 20, 2020Comments
-
Roman Nikitchenko about 4 years
I need to use HDFS cluster from remote desktop through Java API. Everything works OK until it comes to write access. If I'm trying to create any file I receive access permission exception. Path looks good but exception indicates my remote desktop user name which is of course is not what I need to access needed HDFS directory.
The question is: - Is there any way to represent different user name using 'simple' authentication in Java API? - Could you please point some good explanation of authentication / authorization schemes in hadoop / HDFS preferable with Java API examples?
Yes, I already know 'whoami' could be overloaded in this case using shell alias but I prefer to avoid solutions like this. Also specifics here is I dislike usage of some tricks like pipes through SSH and scripts. I'd like to perform everything using just Java API. Thank you in advance.
-
falconepl over 10 yearsI've stumbled upon the same problem as yours. I'm trying to send Hadoop job from a remote client to the cluster that will execute it. In my case the problem is that Cloudera's Hadoop 2.0.0 (Hadoop 2.0.0-cdh4.3.1) doesn't provide UserGroupInformation class that you've used. It seems that corresponding Apache Hadoop versions doesn't provide it neither. There is just an enum named UserGroupInformation - link. How could it be done in such a case then, in your opinion?
-
Roman Nikitchenko over 10 yearsIt's there, just it's not cloudera. I'm using 2.0.0-cdh4.3.1 hadoop client right now.
-
falconepl over 10 yearsWhat do you mean by saying it's there? I've checked Apache Hadoop 2.0.6 API [link] as well as 2.1.0 API [link] (those Javadocs that Apache provides on their website) and unfortunately there is no
UserGroupInformation
class, just the enum that doesn't help much. And by the way, isn't2.0.0-cdh4.3.1
Hadoop that you've mentioned a Cloudera's Hadoop distribution? -
Roman Nikitchenko over 10 yearsMain point here is: CDH4 actually supports 0.20 client which is recommended. Just look here: blog.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0 and then here: cloudera.com/content/cloudera-content/cloudera-docs/…. As you can see they recommend to use 0.20 client.
-
falconepl over 10 yearsOk, I see. If I got it right, CDH4.3 supports both v0.20.2 (MapReduce) and v2.0.0 (MapReduce 2 - YARN) versions - link. The whole versioning thing is pretty obscure. But anyway, I still cannot find neither Hadoop API Javadoc nor hadoop-core JAR for CDH4.3 in Cloudera's repository [link] that has a
UserGroupInformation
class. -
Roman Nikitchenko over 10 yearsHonestly I always use 1.0.4 documentation and it looks good enough. For really though situations I just download CDH4 hadoop sources or javadoc. For example probably most needed things for you: maven.tempo-db.com/artifactory/list/cloudera/org/apache/hadoop/… and maven.tempo-db.com/artifactory/list/cloudera/org/apache/hadoop/…
-
Roman Kazanovskyi over 6 yearsIf you execute it like:
java -jar myjar.jar
File system will be LocalFileSystem. To get DistributedFileSystem execute your jar like:hadoop jar myjar.jar
oryarn jar myjar.jar