How to check a file/folder is present using pyspark without getting exception

11,463

Solution 1

nice to see you on StackOverFlow.

I second dijksterhuis's solution, with one exception - Analysis Exception is very general exception in Spark, and may be resulted for various reasons, not only due to missing file.

If you want to check whether the file exists or not, you'll need to bypass Spark's FS abstraction, and access the storage system directly (Whether is s3, posix, or something else). The down side of this solution is the lack of abstraction - once you'll change your underlying FS, you'll need to change your code as well.

Solution 2

Thanks @Dror and @Kini. I run spark on cluster, and I must add sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]), here s3 is the prefix of the file system of your cluster.

  def path_exists(path):
    # spark is a SparkSession
    sc = spark.sparkContext
    fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
        sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
        sc._jsc.hadoopConfiguration(),
    )
    return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))

Solution 3

fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))

Solution 4

Looks like you should change except IOError: to except AnalysisException:.

Spark throws different errors/exception than regular python in a lot of cases. It’s not doing typical python io operations when reading a file, so makes sense for it to throw a different exception.

Solution 5

The answer posted by @rosefun worked for me but it took lot of time for me to get it working. So I am giving some details about how that solution is working and what are the stuffs you should avoid.

def path_exists(path):
    # spark is a SparkSession
    sc = spark.sparkContext
    fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
        sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
        sc._jsc.hadoopConfiguration(),
    )
    return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))

The function is same and it works fine to check whether a file exists or not in the S3 bucket path that you provided.

You will have to change this function based on how you are specifying your path value to this function.

path = f"s3://bucket-name/import/data/"
pathexists = path_exists(path)

if the path variable that you are defining is having the s3 prefix in the path then it would work.

Also the portion of the code which split the string gets you just the bucket name as follows:

path.split("/")[2] will give you `bucket-name`

but if you don't have s3 prefix in the path then you will have to use the function by changing some code and which is as below:

def path_exists(path):
   # spark is a SparkSession
   sc = spark.sparkContext
   fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
        sc._jvm.java.net.URI.create("s3://" + path),
        sc._jsc.hadoopConfiguration(),
   )
   return fs.exists(sc._jvm.org.apache.hadoop.fs.Path("s3://" + path))
Share:
11,463
Admin
Author by

Admin

Updated on July 02, 2022

Comments

  • Admin
    Admin almost 2 years

    I am trying to keep a check for the file whether it is present or not before reading it from my pyspark in databricks to avoid exceptions? I tried below code snippets but i am getting exception when file is not present

    from pyspark.sql import *
    from pyspark.conf import SparkConf
    SparkSession.builder.config(conf=SparkConf())
    try:
        df = sqlContext.read.format('com.databricks.spark.csv').option("delimiter",",").options(header='true', inferschema='true').load('/FileStore/tables/HealthCareSample_dumm.csv')
        print("File Exists")
    except IOError:
        print("file not found")`
    

    When i have file, it reads file and "prints File Exists" but when the file is not there it will throw "AnalysisException: 'Path does not exist: dbfs:/FileStore/tables/HealthCareSample_dumm.csv;'"

  • D3V
    D3V about 5 years
    AnalysisException is thrown regularly by spark for many other situations, so even though it makes sense on the surface, it is better to check for reason why this Exception occured. So solution proposed by @Prathik makes more sense.
  • Dror
    Dror almost 5 years
    Having S3 filesystem, this approach fails. Here's a solution for S3.
  • andresg3
    andresg3 over 3 years
    @dror do you know if there is a way to check if a path like s3://my-bucket-s3-test/lookuo*.csv exists?
  • Alex Ott
    Alex Ott about 3 years
    it will still throw exception if file doesn't exist
  • Nayan Sarkar
    Nayan Sarkar about 3 years
    Hi @AlexOtt, this function provides you a list of files and folders in a given path. Now you have to be sure about the path up to a certain extent. correct? Then you can look what files and folders and in the sub system and go in accordingly.
  • Jacob Levinson
    Jacob Levinson over 2 years
    This worked for me to verify a file or path existed on S3. However, I didnt need to split the path to create the URI. This is what worked for me. def path_exists(self, path, sc): # spark is a SparkSession fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get( sc._jvm.java.net.URI.create("s3://" + path), sc._jsc.hadoopConfiguration(), ) return fs.exists(sc._jvm.org.apache.hadoop.fs.Path("s3://" + path))