Remove Files from Directory after uploading in Databricks using dbutils

27,524

Solution 1

If you have huge number of files the deleting them in this way might take a lot of time. you can utilize spark parallelism to delete the files in parallel. Answer that I am providing is in scala but can be changed to python.

you can check if the directory exists or not using this function below:

import java.io._
def CheckPathExists(path:String): Boolean = 
{
  try
  {
    dbutils.fs.ls(path)
    return true
  }
  catch
  {
    case ioe:java.io.FileNotFoundException => return false
  }
}

You can define a function that is used to delete the files. you are creating this function inside an object and extends that object from Serializable class as below :

object Helper extends Serializable
{
def delete(directory: String): Unit = {
    dbutils.fs.ls(directory).map(_.path).toDF.foreach { filePath =>
      println(s"deleting file: $filePath")
      dbutils.fs.rm(filePath(0).toString, true)
    }
  }
}

Now you can first check to see if the path exists and if it returns true then you can call the delete function to delete the files within the folder on multiple tasks.

val directoryPath = "<location"
val directoryExists = CheckPathExists(directoryPath)
if(directoryExists)
{
Helper.delete(directoryPath)
}

Solution 2

If you want to delete all files from the following path: '/mnt/adls2/demo/target/', there is a simple command:

dbutils.fs.rm('/mnt/adls2/demo/target/', True)

Anyway, if you want to use your code, take a look at dbutils doc:

rm(dir: String, recurse: boolean = false): boolean -> Removes a file or directory

The second argument of the function is expected to be boolean, but your code has string with path:

dbutils.fs.rm(files[i].path, '/mnt/adls2/demo/target/' + file)

So your new code can be following:

for i in range (0, len(files)):
    file = files[i].name
        if now in file:  
            dbutils.fs.rm(files[i].path + file, True)
            print ('copied     ' + file)
        else:
            print ('not copied ' + file)

Solution 3

In order to remove the files from dbfs you can write this in any notebook

%fs rm -r dbfs:/user/sample_data.parquet
Share:
27,524

Related videos on Youtube

Carltonp
Author by

Carltonp

Updated on February 12, 2021

Comments

  • Carltonp
    Carltonp about 3 years

    A very clever person from StackOverflow assisted me in copying files to a directory from Databricks here: copyfiles

    I am using the same principle to remove the files once it has been copied as shown in the link:

    for i in range (0, len(files)):
      file = files[i].name
      if now in file:  
        dbutils.fs.rm(files[i].path,'/mnt/adls2/demo/target/' + file)
        print ('copied     ' + file)
      else:
        print ('not copied ' + file)
    

    However, I'm getting the error:

    TypeError: '/mnt/adls2/demo/target/' has the wrong type - class bool is expected.

    Can someone let me know how to fix this. I thought it would be simple matter of removing the file after originally copying it using command dbutils.fs.rm

    • Carltonp
      Carltonp over 5 years
      ok, the above example didn't reflect the script we have in production which is: for i in range (0, len(files)): file = files[i].name if now in file: dbutils.fs.rm(files[i].path,'adl://xxxxxxxxxxxx.azuredatalak‌​estore.net/Folder Structure/RAW/1stParty/LCMS/DE/stageone/') print ('removed ' + file) else: print ('not removed ' + file) The problem was because I missed the open brackets . So, the problem isn't the wrong type class bool is expected as stated above, the problem is invalid syntax error at print ('removed ' + file). I hope that helps to fix.
  • Carltonp
    Carltonp over 5 years
    wow @Fabio, I will test this out in the morning. If this works I will won't understand how Databricks experts (whom I have support contract with) couldn't figure this out. Thanks in advance. I will let you know how I get on with it. Cheers
  • Carltonp
    Carltonp over 5 years
    the following command you suggested deletes the folder as well as the files dbutils.fs.rm('/mnt/adls2/demo/target/', True) I just need the files deleted
  • Carltonp
    Carltonp over 5 years
    The actual command in our production is dbutils.fs.rm('adl://devszendsadlsrdpacqncd.azuredatalakesto‌​re.net/Folder Structure/RAW/1stParty/LCMS/DE/stageone', True)
  • Fabio Schultz
    Fabio Schultz over 5 years
    Hello @Carltonp, sorry I'm late to answer you. So I got you can't delete the folder. I have a sugestion, you can use dbutils.fs.rm('/mnt/adls2/demo/target/', True) so after that you can create a folder again dbutils.fs.mkdirs('/mnt/adls2/demo/target/', True) ... If it does not work for you you can list all files and delete one by on like you tried before
  • Carltonp
    Carltonp over 5 years
    'use dbutils.fs.rm('/mnt/adls2/demo/target/', True) so after that you can create a folder again' that's exactly what I did. Thank you soooo much. Also, just so you know both of your suggestions worked. I hope you don't mind, but I have shared your solution with Databricks. Thanks man
  • Fabio Schultz
    Fabio Schultz over 5 years
    @Carltonp I'm glad to know that :)
  • gszecsenyi
    gszecsenyi about 4 years
    The problem with your solution (delete and create) is that, that it removes the permission settings on the datalake folders too. So the customer cannot access the data when some special permission was set on the folder..
  • Aaron Robeson
    Aaron Robeson over 3 years
    @FabioSchultz The mkdir command above should be dbutils.fs.mkdirs('/mnt/adls2/demo/target/') it only takes one argument.