How to Find duplicates files (by name and/or size) in a specific folder of my google drive?

10,989

For better readability I add the next part of this answer in a second answer.

Here is the code to handle the duplicates : it works in 2 steps :

  1. detect the duplicates and mark them in the sheet to check if everything is fine (markDuplicates())
  2. remove the red background on the files you want to preserve (they will become yellow after step 3 is done)
  3. delete the selected (marked) files, actually move them to trash, you'll have to empty the trash manually from the drive Ui. (trashDuplicates()) The deleted files will be colored in deep red to confirm deletion.

LAST EDIT :

Placed the full code here and made some functions private to avoid errors due to bad script calls.

function startProcess(){
  PropertiesService.getScriptProperties().deleteAllProperties();
  try{
    ScriptApp.deleteTrigger(ScriptApp.getProjectTriggers()[0]);
  }catch(e){}
  var sh = SpreadsheetApp.getActiveSheet();
  sh.getDataRange().clear();
  sh.getRange(1,1,1,4).setValues([['fileName (logged @'+Utilities.formatDate(new Date(),Session.getScriptTimeZone(),'MMM-dd-yyyy HH:mm')+')','fileSize','parent folders tree','fileID']]);
  var trig = ScriptApp.newTrigger('getDriveFilesList_').timeBased().everyMinutes(5).create();
  Logger.log(trig.getUniqueId()+'  '+trig.getHandlerFunction());
  getDriveFilesList_();
}

function getDriveFilesList_(){
  var content = [];
  var startTime = new Date().getTime();
  var sh = SpreadsheetApp.getActiveSheet();
  if( ! PropertiesService.getScriptProperties().getProperty('numberOfFiles')){
    PropertiesService.getScriptProperties().setProperty('numberOfFiles',0);
  }

  var numberOfFiles = Number(PropertiesService.getScriptProperties().getProperty('numberOfFiles'));
  Logger.log(numberOfFiles);
  var max = numberOfFiles+10000;
  if( ! PropertiesService.getScriptProperties().getProperty('continuationToken')){
    var files = DriveApp.getFiles();
   // var files = DriveApp.getFolderById('0B3qSFd_____MTFZMDQ').getFiles();// use this line and comment the above if you want to process a single folder
   // use your chozen folder ID of course (available from the browser url , the part after "https://drive.google.com/?authuser=0#folders/")
  }else{
    var files = DriveApp.continueFileIterator(PropertiesService.getScriptProperties().getProperty('continuationToken'))
    }
  while(files.hasNext() && numberOfFiles<(max)){
    var file = files.next()
    if(file.getSize()>0){
      numberOfFiles++;
      var folder = '(shared)';
      if(file.getParents().hasNext()){folder = getTree_(file)}
      content.push([file.getName(),file.getSize(),folder,file.getId()])
    }    
    if(new Date().getTime()-startTime > 250000){break};
  }
  sh.getRange(sh.getLastRow()+1,1,content.length,content[0].length).setValues(content);
  if(!files.hasNext()){ScriptApp.deleteTrigger(ScriptApp.getProjectTriggers()[0]);Logger.log('done !'); sh.getRange(sh.getLastRow()+1,1).setValue('All files processed ('+numberOfFiles+' found)')};
  var continuationToken = files.getContinuationToken()
  PropertiesService.getScriptProperties().setProperty('numberOfFiles',numberOfFiles);
  PropertiesService.getScriptProperties().setProperty('continuationToken',continuationToken);
}

function markDuplicates(){
  handleDuplicates_(false)
}

function trashDuplicates(){
  handleDuplicates_(true)
}

function handleDuplicates_(trash){  
  var sh = SpreadsheetApp.getActiveSheet();
  sh.setFrozenRows(1);
  sh.sort(1);
  var data = sh.getDataRange().getValues()
  var headers = data.shift()
  var lastComment = data.pop();
  var toDelete = [];
  var item = data[0];
  for(var n=1 ; n<data.length; n++){
    if(data[n][0]==item[0] && data[n][1]==item[1]){
      toDelete.push('delete '+n);
    }
    item=data[n];
  }
  var marker =   sh.getRange(2,1,data.length,1).getBackgrounds();
  for(var n in data){
    if(!trash){marker.push(['#FFF'])};
    if(toDelete.indexOf('delete '+n)>-1 && !trash){
      marker[n][0] = '#F99';
    }
    if(toDelete.indexOf('delete '+n)>-1 && trash){
      if(marker[n][0]=='#ff9999'){
        try{
        DriveApp.getFileById(data[n][3]).setTrashed(trash);
        marker[n][0] = '#F33';
        }catch(err){Logger.log(err)}
      }else{
        marker[n][0] = '#FF9';
      }
    }
  }
  sh.getRange(2,1,marker.length,1).setBackgrounds(marker);
}

function getTree_(file){
  var tree = [];
  var folderP = file.getParents()
  while (folderP.hasNext()){
    var folder = folderP.next();
    folderP = folder.getParents();
    tree.push(folder.getName());
  }
  return tree.reverse().join('/');
}
Share:
10,989

Related videos on Youtube

miodf
Author by

miodf

Updated on June 04, 2022

Comments

  • miodf
    miodf almost 2 years

    Is it possible to create a Google apps script which would select automatically duplicate files inside a specific (current) folder (then I just had to delete them with one click) ?

    https://developers.google.com/apps-script/reference/drive/

    I am asking this because Google drive pc client create random duplicates files when updating them ! They want to solve this bug but I have to manually select them in each folders and then delete them. It is very annoying and time consuming. https://productforums.google.com/forum/#!category-topic/drive/report-a-problem/_ApOopI-Xkw

    Edit: Alternatively, as windows cannot create the same filename inside the same folder, find and put in the trash all duplicate filenames inside a specific main folder (and its each of its related subfolders).

  • miodf
    miodf about 10 years
    Thanks. For : "I left aside the duplicate detection(...)" in my case I am uploading some folders of photos and videos (per year), so there are no google docs inside those folder. ;) Re: you "note", I am afraid I am just a too small beginner in GAS for doing that already. Anyway I will try your script right away.Thanks. ;)
  • miodf
    miodf about 10 years
    Thanks. No problem. ;) Here my test results: I create a new google spreadsheet. Go to Tools/Script Editor. Copy/paste your code. Save it. Run script. Authorize permission. Then wait a few seconds and go to View/logs to see the results: it displays: 100 filenames with their size. ;) I was not able to detect for now if files are found in a particular order or folder.
  • miodf
    miodf about 10 years
    Thanks for your second version: I run it and at the end of the Execution transcript I have : Execution failed: Cannot find method deleteTrigger((class)). (line 39, file "Code") [250.835 seconds total runtime]. It displays about 2500 rows in the spreadsheet (new version). Thanks in advance ;)
  • miodf
    miodf about 10 years
    Thanks for the edit : I have tried to add a folder only : var folder = 'PHOTOS'; instead of var folder = 'root'; but it didn't work.
  • Serge insas
    Serge insas about 10 years
    if you want to get only one folder I'd suggest to sort the files in the spreadsheet, that would be the easiest way... else you'd have to change var files = DriveApp.getFiles(); (line 23 in code) and replace driveApp by a folder object. (the var=(shared) is for display only)
  • Serge insas
    Serge insas about 10 years
    The error you get is because you didnt call the right function : you have to use startProcess and in case of failure you have to clear the trigger and scriptProperties manually.
  • Serge insas
    Serge insas about 10 years
    see next episode in answer #2 ;)
  • miodf
    miodf about 10 years
    Thanks. Yesterday, I had the error "Service using too much computer time for one day" for "Summary of failures for Google Apps Script: find_duplicates". I have changed the notifications settings in the current project trigger to immediate.
  • miodf
    miodf about 10 years
    Your code is working. For handleduplicate,I have about 15,000 rows in the spreadsheet. It is highlighting in red duplicate filenames that have the same size. ;) Would it be possible to add not only "parent folder" but also "parent folders" of "parent folder" ? Example: not just"2010" but "PHOTOS/2010". Thanks in advance ;)
  • miodf
    miodf about 10 years
    Another example: with html files. hpics.li/66439ae Row 1 = row 3 (what I find surprising is that even the file id is the same!). Same thing for row 2 = row 4.
  • Serge insas
    Serge insas about 10 years
    if you see the same id then it is not a duplicate but the very same file in 2 or more different folders... . In that case, you should not delete it. I guess the script should be tuned to allow selective delete depending on color in the sheet... what do you think ? As for the folder, it is of course doable to show the whole folder tree , there has been a few SO post on that subject, feel free to append some necessary code.
  • miodf
    miodf about 10 years
    Alas it is really a duplicate ! If I try a search with the filename, you can see that there are 2 files in the same folder here: hpics.li/6f5103f . Then, if I go to the folder, there is only one file shown (see top right hpics.li/dbe8a05 ) ! I did not run your function trashDuplicates just handleDuplicates so it is not in the trash either : hpics.li/8dd4546 note: I have only one folder with the same name : hpics.li/2ff3db9
  • miodf
    miodf about 10 years
    I think I may know why I have only 1 of the 2 files shown in the folder : I have removed this folder from my pc yesterday ! And Google drive client for pc has not already removed them online.
  • miodf
    miodf about 10 years
    Here is another better test with only one file I have created with "Fastone Capture" in my pc inside the folder "screenshots_me" located inside the Google Drive local PC folder. Google drive pc client created alone a duplicate online and locally on my pc! In my PC: it has appended it "(1).jpg" to the filename. And online it has kept the same filename ! Your duplicate spreadsheet shows 4 same filenames and size located in the same folder but with only 2 that have the same fileID hpics.li/0d0de03 If I do a search in my drive I have : hpics.li/7e64a6c (...)
  • miodf
    miodf about 10 years
    (...) If I look inside the online folder (see top right) : hpics.li/cb58df1 My local pc folder view hpics.li/7fb2e52 (I manually created only one file 2014-03-29_095332.jpg)
  • miodf
    miodf about 10 years
    Imho, I should choose at the beginning a specific folder like "PHOTOS". As it has a far smaller number of files (and contains also the biggest files sizes) than my other folders in drive, I won't have again, like yesterday and today, the error "Service using too much computer time for one day" (i.e.: one hour of computer time in my case). Per your suggestion, I have changed var files = DriveApp.getFiles(); with var files = getFoldersByName('PHOTOS').getFiles(); alas it didn't work as I expected ! Thanks in advance ;)
  • Serge insas
    Serge insas about 10 years
    Should be var files = DriveApp.getFoldersByName('PHOTOS').getFiles();
  • miodf
    miodf about 10 years
    Thanks. ;) But it didn't work either on my part ! I also tried unsucessfully var folderToSearch = "PHOTOS"; with var files = DriveApp.getFoldersByName(folderToSearch).getFiles(); Both list files inside other folders ! My PHOTO folder is located closer to the root "My Drive". Thanks in advance ;)
  • Serge insas
    Serge insas about 10 years
    last version : I added a folder tree display and the ability to select which file to delete or not.
  • miodf
    miodf about 10 years
    Many thanks. When I run the function getTree I have this error : TypeError: Cannot call method "getParents" of undefined. (line 90, file "Code"). Thanks in advance ;)
  • Serge insas
    Serge insas about 10 years
    You didn't read my instructions : only 2 functions to run in this part : markDuplicate and trashDuplicates. The other are called from within the script, don't use them directly since they need parameters.In the first part use only startProcess. That said, I begin to think that you could accept this answer... don't you ?