Spark iterate HDFS directory
51,891
Solution 1
You can use org.apache.hadoop.fs.FileSystem
. Specifically, FileSystem.listFiles([path], true)
And with Spark...
FileSystem.get(sc.hadoopConfiguration).listFiles(..., true)
Edit
It's worth noting that good practice is to get the FileSystem
that is associated with the Path
's scheme.
path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)
Solution 2
Here's PySpark version if someone is interested:
hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
path = hadoop.fs.Path('/hivewarehouse/disc_mrt.db/unified_fact/')
for f in fs.get(conf).listStatus(path):
print(f.getPath(), f.getLen())
In this particular case I get list of all files that make up disc_mrt.unified_fact Hive table.
Other methods of FileStatus object, like getLen() to get file size are described here:
Solution 3
import org.apache.hadoop.fs.{FileSystem,Path}
FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path("hdfs:///tmp")).foreach( x => println(x.getPath ))
This worked for me.
Spark version 1.5.0-cdh5.5.2
Solution 4
this did the job for me
FileSystem.get(new URI("hdfs://HAservice:9000"), sc.hadoopConfiguration).listStatus( new Path("/tmp/")).foreach( x => println(x.getPath ))
Solution 5
@Tagar didn't say how to connect remote hdfs, but this answer did:
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(URI("hdfs://somehost:8020"), Configuration())
status = fs.listStatus(Path('/some_dir/yet_another_one_dir/'))
for fileStatus in status:
print(fileStatus.getPath())
Author by
Jon
Updated on July 09, 2022Comments
-
Jon almost 2 years
I have a directory of directories on HDFS, and I want to iterate over the directories. Is there any easy way to do this with Spark using the SparkContext object?
-
JimLohse about 8 yearsreally nice! I had this question, granted, I guess this wouldn't work in the original spark-submit call
-
horatio1701d over 6 yearsHow can I create a list of the files using the RemoteIterator this creates?
-
Shrinivas Deshmukh almost 6 yearsan upvote for pyspark version! worked perfectly! thanks!
-
ASH almost 5 yearsThis worked fine for me, for a single folder. Is there some way to get this to run at the level of the parent folder, and get all files in all subfolders? That would be VERY helpful/useful for me.
-
Denis Ka over 2 yearsthank U! was wondering how to make it in python