apache spark - check if file exists
Solution 1
For a file in HDFS, you can use the hadoop way of doing this:
val conf = sc.hadoopConfiguration
val fs = org.apache.hadoop.fs.FileSystem.get(conf)
val exists = fs.exists(new org.apache.hadoop.fs.Path("/path/on/hdfs/to/SUCCESS.txt"))
Solution 2
For Pyspark, you can achieve this without invoking a subprocess using something like:
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))
Solution 3
I will say, best way to call this through function which internally check for file presence in the traditional hadoop file check.
object OutputDirCheck {
def dirExists(hdfsDirectory: String): Boolean = {
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val fs = org.apache.hadoop.fs.FileSystem.get(hadoopConf)
fs.exists(new org.apache.hadoop.fs.Path(hdfsDirectory))
}
}
Solution 4
Using dbutils:
def path_exists(path):
try:
if len(dbutils.fs.ls(path)) > 0:
return True
except:
return False
Solution 5
For Java coders;
SparkConf sparkConf = new SparkConf().setAppName("myClassname");
SparkContext sparky = new SparkContext(sparkConf);
JavaSparkContext context = new JavaSparkContext(sparky);
FileSystem hdfs = org.apache.hadoop.fs.FileSystem.get(context.hadoopConfiguration());
Path path = new Path(sparkConf.get(path_to_File));
if (!hdfs.exists(path)) {
//Path does not exist.
}
else{
//Path exist.
}
Comments
-
Chandra over 2 years
I am new to spark and I have a question. I have a two step process in which the first step write a SUCCESS.txt file to a location on HDFS. My second step which is a spark job has to verify if that SUCCESS.txt file exists before it starts processing the data.
I checked the spark API and didnt find any method which checks if a file exists. Any ideas how to handle this?
The only method I found was sc.textFile(hdfs:///SUCCESS.txt).count() which would throw an exception when the file does not exist. I have to catch that exception and write my program accordingly. I didnt really like this approach. Hoping to find a better alternative.