How to check a file/folder is present using pyspark without getting exception
Solution 1
nice to see you on StackOverFlow.
I second dijksterhuis's solution, with one exception - Analysis Exception is very general exception in Spark, and may be resulted for various reasons, not only due to missing file.
If you want to check whether the file exists or not, you'll need to bypass Spark's FS abstraction, and access the storage system directly (Whether is s3, posix, or something else). The down side of this solution is the lack of abstraction - once you'll change your underlying FS, you'll need to change your code as well.
Solution 2
Thanks @Dror and @Kini. I run spark on cluster, and I must add sc._jvm.java.net.URI.create("s3://" + path.split("/")[2])
, here s3
is the prefix of the file system of your cluster.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
Solution 3
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))
Solution 4
Looks like you should change except IOError:
to except AnalysisException:
.
Spark throws different errors/exception than regular python in a lot of cases. It’s not doing typical python io operations when reading a file, so makes sense for it to throw a different exception.
Solution 5
The answer posted by @rosefun worked for me but it took lot of time for me to get it working. So I am giving some details about how that solution is working and what are the stuffs you should avoid.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
The function is same and it works fine to check whether a file exists or not in the S3 bucket path that you provided.
You will have to change this function based on how you are specifying your path value to this function.
path = f"s3://bucket-name/import/data/"
pathexists = path_exists(path)
if the path variable that you are defining is having the s3 prefix in the path then it would work.
Also the portion of the code which split the string gets you just the bucket name as follows:
path.split("/")[2] will give you `bucket-name`
but if you don't have s3 prefix in the path then you will have to use the function by changing some code and which is as below:
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path("s3://" + path))

Admin
Updated on July 02, 2022Comments
-
Admin 11 months
I am trying to keep a check for the file whether it is present or not before reading it from my pyspark in databricks to avoid exceptions? I tried below code snippets but i am getting exception when file is not present
from pyspark.sql import * from pyspark.conf import SparkConf SparkSession.builder.config(conf=SparkConf()) try: df = sqlContext.read.format('com.databricks.spark.csv').option("delimiter",",").options(header='true', inferschema='true').load('/FileStore/tables/HealthCareSample_dumm.csv') print("File Exists") except IOError: print("file not found")`
When i have file, it reads file and "prints File Exists" but when the file is not there it will throw "AnalysisException: 'Path does not exist: dbfs:/FileStore/tables/HealthCareSample_dumm.csv;'"
-
D3V about 4 years
AnalysisException
is thrown regularly by spark for many other situations, so even though it makes sense on the surface, it is better to check for reason why this Exception occured. So solution proposed by @Prathik makes more sense. -
Dror almost 4 yearsHaving S3 filesystem, this approach fails. Here's a solution for S3.
-
andresg3 over 2 years@dror do you know if there is a way to check if a path like
s3://my-bucket-s3-test/lookuo*.csv
exists? -
Alex Ott about 2 yearsit will still throw exception if file doesn't exist
-
Nayan Sarkar about 2 yearsHi @AlexOtt, this function provides you a list of files and folders in a given path. Now you have to be sure about the path up to a certain extent. correct? Then you can look what files and folders and in the sub system and go in accordingly.
-
Jacob Levinson over 1 yearThis worked for me to verify a file or path existed on S3. However, I didnt need to split the path to create the URI. This is what worked for me. def path_exists(self, path, sc): # spark is a SparkSession fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get( sc._jvm.java.net.URI.create("s3://" + path), sc._jsc.hadoopConfiguration(), ) return fs.exists(sc._jvm.org.apache.hadoop.fs.Path("s3://" + path))