Overriding default hadoop jars in class path

19,150

Solution 1

So, assuming you're using 0.20.203, this is handled in the TaskRunner.java code as follows:

  • The property you're looking for is on line 94 - mapreduce.user.classpath.first
  • Line 214 is where the call is made to build the list of classpaths, which delegates to a method called getClassPaths(..)
  • getClassPaths() is defined on line 524, and you should be able to see that the configuration property is used to decide on whether your job + dist cache libraries, or the hadoop libraries go on the classpath first

For other versions of hadoop, you're best to check the TaskRunner.java class to confirm the name of the config property after all this is a "semi hidden config":

static final String MAPREDUCE_USER_CLASSPATH_FIRST =
        "mapreduce.user.classpath.first"; //a semi-hidden config

Solution 2

As in the latest Hadoop version (2.2+), you should set:

    conf.setBoolean(MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true);

Solution 3

These settings work for referencing classes of external jars only in your mapper or reducer tasks. If, however, you are using these in, for example a customized InputFormat, it will fail to load the class. A way to make sure this also works everywhere (in MR2) is exporting this setting when submitting your job:

export HADOOP_USER_CLASSPATH_FIRST=true

Solution 4

I had the same issue and the parameter that worked for me on Hadoop Version 0.20.2-cdhu03 is "mapreduce.task.classpath.user.precedence"

This setting is tested not work on CDH3U3, following answer is from Cloudera team:

// JobConf job = new JobConf(getConf(), MyJob.class);
// job.setUserClassesTakesPrecedence(true);

http://archive.cloudera.com/cdh/3/hadoop/api/org/apache/hadoop/mapred/JobConf.html#setUserClassesTakesPrecedence%28boolean%29

Share:
19,150

Related videos on Youtube

jayunit100
Author by

jayunit100

Current: Red Hat BigData, Apache BigTop commiter. Past: Phd in scalable, data driven bioinformatics analytics tools on the JVM, which led me into the world of big data as the genomic data space started to explode. After that, I was with peerindex as a hadoop mapreduce dev, and now I'm a big data engineer at redhat. We're making red hat storage awesome(r). blog: http://jayunit100.blogspot.com. github: http://github.com/jayunit100 pubs : https://www.researchgate.net/profile/Jay_Vyas/publications/?ev=prf_pubs_p2

Updated on September 15, 2022

Comments

  • jayunit100
    jayunit100 over 1 year

    I've seen many manifestations of ways to use the user class path as precedent to the hadoop one. Often times this is done if an m/r job needs a specific version of a library that hadoop coincidentally already uses an older version of (for example jackson's json parser or commons http , etc.)

    In any case : I've seen :

    mapreduce.task.classpath.user.precedence
    mapreduce.task.classpath.first
    mapreduce.job.user.classpath.first
    

    Which one of these parameters is the right one to set in my job configuration, in order to force mappers and reducers to have a class path which puts my user defined hadoop_classpath jars BEFORE the hadoop default dependency jars ?

    By the way, this is related to this question : Dynamodb requestHandler acception which I recently have found is due to a jar conflict.

  • Nicholas White
    Nicholas White about 11 years
    Unfortunately this feature doesn't exist in 0.20.2...
  • Vasu
    Vasu over 9 years
    @Chris, interesting answer. But why this is hidden (or semi-hidden) at all? any thoughts?