pyspark : how to check if a file exists in hdfs

11,127

Rigth how it says Tristan Reid:

...(Spark) It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.

Anyway, this is his answer to a related question: Pyspark: get list of files/directories on HDFS path

Once you have the list of files in a directory, it is easy to check if a particular file exist.

I hope it can help somehow.

Share:
11,127
A7med
Author by

A7med

Updated on June 13, 2022

Comments

  • A7med
    A7med almost 2 years

    I want to check if several files exist in hdfs before load them by SparkContext. I use pyspark. I tried os.system("hadoop fs -test -e %s" %path) but as I have a lot of paths to check, the job crashed. I tried also sc.wholeTextFiles(parent_path) and then filter by keys. but it crashed also because the parent_path contains a lot of sub paths and files. Could you help me?