HDFS access from remote host through Java API, user authentication

java security authentication hadoop hdfs

27,132

After some studying I came to the following solution:

I don't actually need the full Kerberos solution, it is enough currently that clients can run HDFS requests from any user. Environment itself is considered secure.
This gives me solution based on hadoop UserGroupInformation class. In future I can extend it to support Kerberos.

Sample code probably useful for people both for 'fake authentication' and remote HDFS access:

package org.myorg;

import java.security.PrivilegedExceptionAction;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.security.UserGroupInformation;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;

public class HdfsTest {

    public static void main(String args[]) {

        try {
            UserGroupInformation ugi
                = UserGroupInformation.createRemoteUser("hbase");

            ugi.doAs(new PrivilegedExceptionAction<Void>() {

                public Void run() throws Exception {

                    Configuration conf = new Configuration();
                    conf.set("fs.defaultFS", "hdfs://1.2.3.4:8020/user/hbase");
                    conf.set("hadoop.job.ugi", "hbase");

                    FileSystem fs = FileSystem.get(conf);

                    fs.createNewFile(new Path("/user/hbase/test"));

                    FileStatus[] status = fs.listStatus(new Path("/user/hbase"));
                    for(int i=0;i<status.length;i++){
                        System.out.println(status[i].getPath());
                    }
                    return null;
                }
            });
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Useful reference for those who have a similar problem:

Cloudera blog post "Authorization and Authentication In Hadoop". Short, focused on simple explanation of hadoop security approaches. No information specific to Java API solution but good for basic understanding of the problem.

UPDATE:
Alternative for those who uses command line hdfs or hadoop utility without local user needed:

 HADOOP_USER_NAME=hdfs hdfs fs -put /root/MyHadoop/file1.txt /

What you actually do is you read local file in accordance to your local permissions but when placing file on HDFS you are authenticated like user hdfs.

This has pretty similar properties to API code illustrated:

You don't need sudo.
You don't need actually appropriate local user 'hdfs'.
You don't need to copy anything or change permissions because of previous points.

27,132

Author by

Roman Nikitchenko

Big data and Telecom domain expert. Current focus is on data storage and processing solutions based on Hadoop / HBase infrastructure. Lot of experience as software engineer / technical leader with areas such as networking, VoIP, PWE, Ethernet, SONET, ATM, TDM, multithreading / multitasking / distributed programming. OS: Linux, VxWorks, FreeBSD, RTXC. Broad technology range starting from lot of experience with embedded devices and up to distributed enterprise systems based on Java technologies.

Updated on April 20, 2020

Comments

Roman Nikitchenko about 4 years

I need to use HDFS cluster from remote desktop through Java API. Everything works OK until it comes to write access. If I'm trying to create any file I receive access permission exception. Path looks good but exception indicates my remote desktop user name which is of course is not what I need to access needed HDFS directory.

The question is: - Is there any way to represent different user name using 'simple' authentication in Java API? - Could you please point some good explanation of authentication / authorization schemes in hadoop / HDFS preferable with Java API examples?

Yes, I already know 'whoami' could be overloaded in this case using shell alias but I prefer to avoid solutions like this. Also specifics here is I dislike usage of some tricks like pipes through SSH and scripts. I'd like to perform everything using just Java API. Thank you in advance.
falconepl over 10 years

I've stumbled upon the same problem as yours. I'm trying to send Hadoop job from a remote client to the cluster that will execute it. In my case the problem is that Cloudera's Hadoop 2.0.0 (Hadoop 2.0.0-cdh4.3.1) doesn't provide UserGroupInformation class that you've used. It seems that corresponding Apache Hadoop versions doesn't provide it neither. There is just an enum named UserGroupInformation - link. How could it be done in such a case then, in your opinion?
Roman Nikitchenko over 10 years

It's there, just it's not cloudera. I'm using 2.0.0-cdh4.3.1 hadoop client right now.
falconepl over 10 years

What do you mean by saying it's there? I've checked Apache Hadoop 2.0.6 API [link] as well as 2.1.0 API [link] (those Javadocs that Apache provides on their website) and unfortunately there is no UserGroupInformation class, just the enum that doesn't help much. And by the way, isn't 2.0.0-cdh4.3.1 Hadoop that you've mentioned a Cloudera's Hadoop distribution?
Roman Nikitchenko over 10 years

Main point here is: CDH4 actually supports 0.20 client which is recommended. Just look here: blog.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-‌0 and then here: cloudera.com/content/cloudera-content/cloudera-docs/…. As you can see they recommend to use 0.20 client.
falconepl over 10 years

Ok, I see. If I got it right, CDH4.3 supports both v0.20.2 (MapReduce) and v2.0.0 (MapReduce 2 - YARN) versions - link. The whole versioning thing is pretty obscure. But anyway, I still cannot find neither Hadoop API Javadoc nor hadoop-core JAR for CDH4.3 in Cloudera's repository [link] that has a UserGroupInformation class.
Roman Nikitchenko over 10 years

Honestly I always use 1.0.4 documentation and it looks good enough. For really though situations I just download CDH4 hadoop sources or javadoc. For example probably most needed things for you: maven.tempo-db.com/artifactory/list/cloudera/org/apache/hado‌op/… and maven.tempo-db.com/artifactory/list/cloudera/org/apache/hado‌op/…
Roman Kazanovskyi over 6 years

If you execute it like: java -jar myjar.jar File system will be LocalFileSystem. To get DistributedFileSystem execute your jar like: hadoop jar myjar.jar or yarn jar myjar.jar