pyspark : how to check if a file exists in hdfs
Rigth how it says Tristan Reid:
...(Spark) It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.
Anyway, this is his answer to a related question: Pyspark: get list of files/directories on HDFS path
Once you have the list of files in a directory, it is easy to check if a particular file exist.
I hope it can help somehow.
A7med
Updated on June 13, 2022Comments
-
A7med almost 2 years
I want to check if several files exist in hdfs before load them by SparkContext. I use pyspark. I tried
os.system("hadoop fs -test -e %s" %path)
but as I have a lot of paths to check, the job crashed. I tried alsosc.wholeTextFiles(parent_path)
and then filter by keys. but it crashed also because the parent_path contains a lot of sub paths and files. Could you help me?