apache spark - check if file exists

62,711

Solution 1

For a file in HDFS, you can use the hadoop way of doing this:

val conf = sc.hadoopConfiguration
val fs = org.apache.hadoop.fs.FileSystem.get(conf)
val exists = fs.exists(new org.apache.hadoop.fs.Path("/path/on/hdfs/to/SUCCESS.txt"))

Solution 2

For Pyspark, you can achieve this without invoking a subprocess using something like:

fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))

Solution 3

I will say, best way to call this through function which internally check for file presence in the traditional hadoop file check.

object OutputDirCheck {
  def dirExists(hdfsDirectory: String): Boolean = {
    val hadoopConf = new org.apache.hadoop.conf.Configuration()
    val fs = org.apache.hadoop.fs.FileSystem.get(hadoopConf)
    fs.exists(new org.apache.hadoop.fs.Path(hdfsDirectory))
  }
}

Solution 4

Using dbutils:

def path_exists(path):
  try:
    if len(dbutils.fs.ls(path)) > 0:
      return True
  except:
    return False

Solution 5

For Java coders;

 SparkConf sparkConf = new SparkConf().setAppName("myClassname");
        SparkContext sparky = new SparkContext(sparkConf);       
        JavaSparkContext context = new JavaSparkContext(sparky);

     FileSystem hdfs = org.apache.hadoop.fs.FileSystem.get(context.hadoopConfiguration());
            Path path = new Path(sparkConf.get(path_to_File));

            if (!hdfs.exists(path)) {
                 //Path does not exist.
            } 
         else{
               //Path exist.
           }
Share:
62,711
Chandra
Author by

Chandra

Software developer

Updated on October 29, 2021

Comments

  • Chandra
    Chandra over 2 years

    I am new to spark and I have a question. I have a two step process in which the first step write a SUCCESS.txt file to a location on HDFS. My second step which is a spark job has to verify if that SUCCESS.txt file exists before it starts processing the data.

    I checked the spark API and didnt find any method which checks if a file exists. Any ideas how to handle this?

    The only method I found was sc.textFile(hdfs:///SUCCESS.txt).count() which would throw an exception when the file does not exist. I have to catch that exception and write my program accordingly. I didnt really like this approach. Hoping to find a better alternative.