Scala & DataBricks: Getting a list of Files

10,351

You should do :

val name : String = ???   
val all_files : Seq[String] = dbutils.fs.ls("s3://bucket").map(_.path).filter(_.matches(name))
Share:
10,351
con
Author by

con

Education: Ph.D. Chemistry 2015. Current preferred are languages Perl and Raku a.k.a. Perl6. Maintain my own program in GNU C that identifies differentially methylated regions in bisulfite sequencing data: https://github.com/hhg7/defiant Will use Python or Ruby. Learning Apache Spark in Python and Scala.

Updated on June 28, 2022

Comments

  • con
    con almost 2 years

    I am trying to make a list of files in an S3 bucket on Databricks within Scala, and then split by regex. I am very new to Scala. The python equivalent would be

    all_files = map(lambda x: x.path, dbutils.fs.ls(folder))
    filtered_files = filter(lambda name: True if pattern.match(name) else False, all_files)
    

    but I want to do this in Scala.

    From https://alvinalexander.com/scala/how-to-list-files-in-directory-filter-names-scala

    import java.io.File
    def getListOfFiles(dir: String):List[File] = {
        val d = new File(dir)
        if (d.exists && d.isDirectory) {
            d.listFiles.filter(_.isFile).toList
        } else {
            List[File]()
        }
    }
    

    However, this produces an empty list.

    I've also thought of

    var all_files: List[Any] = List(dbutils.fs.ls("s3://bucket"))
    

    but this produces a list of things like (with length 1)

    all_files: List[Any] = List(WrappedArray(FileInfo(s3://bucket/.internal_name.pl.swp, .internal_name.pl.swp, 12288), FileInfo(s3://bucket/file0, 10223616), FileInfo(s3://bucket/, file1, 0), ....)
    

    which has a length of 1. I cannot turn this into a dataframe, as suggested by How to iterate scala wrappedArray? (Spark) This isn't usable.

    How can I generate a list of files in Scala, and then iterate through them?

  • con
    con over 5 years
    thank you! As an aside, what do you call this _.? it seems to be a default input, similar to Perl's $_
  • Yauheni Leaniuk
    Yauheni Leaniuk almost 2 years
    NotImplementedError: an implementation is missing - error w/ name variable