Recursively process zip archives to extract files while discarding specific format of files

8,284

Modifying the answer found here, this piece of PowerShell script should do what you want. Just save it as a file with the Extension ".ps1". When calling it, just call it as ./filename.ps1 and it will extract the files to separate folders, delete the zip files and remove all files with .pdf extension. I have not tested if it works properly with recursive paths, but it should, please test it.

Edit: If you don't want your zip files to be deleted, remove or comment out (#) the line rmdir -Path $_.FullName -Force

Requirements: PowerShell, 7-zip and for you to set the 7-zip path in the file.

param([string]$folderPath="D:\Blah\files")

Get-ChildItem $folderPath -recurse | %{ 

    if($_.Name -match "^*.`.zip$")
    {
        $parent="$(Split-Path $_.FullName -Parent)";    
        write-host "Extracting $($_.FullName) to $parent"

        $arguments=@("e", "`"$($_.FullName)`"", "-o`"$($parent)\$($_.BaseName)`"");
        $ex = start-process -FilePath "`"C:\Program Files\7-Zip\7z.exe`"" -ArgumentList $arguments -wait -PassThru;

        if( $ex.ExitCode -eq 0)
        {
            write-host "Extraction successful, deleting $($_.FullName)"
            rmdir -Path $_.FullName -Force
            $arguments1="$($parent)\$($_.BaseName)\*.pdf"
            rmdir -Recurse -Path $arguments1
        }
    }
}
Share:
8,284

Related videos on Youtube

Fr0zenFyr
Author by

Fr0zenFyr

Updated on September 18, 2022

Comments

  • Fr0zenFyr
    Fr0zenFyr over 1 year

    UPDATE: I noticed that many people are viewing this thread, which makes me believe that this situation is not so rare after all. Anyway, I had asked a similar/related question on SO here, which has pretty decent solutions too which might solve the problem in a better way.

    On my Windows 7 machine, I have a directory full of downloaded dumps in ZIP archives. Each archive contains few text files, PDFs and rarely XML files. I want to extract all the contents of each ZIP archive into its respective folder(must be created during the process) while discarding/ignoring extraction of PDFs. After extraction of required files from an archive, processed zip must not be deleted(or I would like to know how I can control it in different situations).

    If it helps to know, the number of archives in the directory is in the range of 60k-70k. Also, I need separate output directories because files in an archive may have same names as files in other.

    For example,

    • I have all my archives like one.zip, two.zip,.. in, say, D:\data
    • I create a new folder for processed data, say, D:\extracted
    • Now the data from D:\data\one.zip should go to D:\extracted\one. Here, D:\extracted\one should be created automatically.
    • During this complete uncompression process, all the encountered PDFs should not be extracted(be ignored). There's no point in extracting and then deleting.
    • (Optional) A log file should be maintained at, say, D:\extracted. Idea is to use this file to resume processing from where it was left in case of an error.
    • (Optional) Script should let me decide whether I want to keep source archives or delete them after processing.

    I already did some search to find a solution but couldn't find one. I came across few questions like these

    1. Recursively unzip files where they reside, then delete the archives
    2. 7 zip extract recursively
    3. Is it possible to recursively list zip file contents with 7 zip without extracting

    but they were not of much help(I'm not a pro with Windows by the way). I'm open to installing safe and ad free 3rd party software(open-source) like 7-zip.

    EDIT: Is there a tool readily available to do what I need, I already tried Multi Unpacker. It doesn't create new directories, it can't ignore *.pdf files. It's even slow to start with, I think it first reads all the archives at source before starting to process them.

    Thanks in advance!

    • private_meta
      private_meta almost 10 years
      I don't see any way around this without a batch or powershell script, as far as I know there is no out-of-the-box solution for something like this.
    • Fr0zenFyr
      Fr0zenFyr almost 10 years
      @private_meta thanks for your response. I had already guessed it by now, but it's good to be sure. Can you point me in the right direction for writing a powershell for this. I also understand that ignoring PDFs during extraction is a huge challenge, so I'm ready to let the script extract everything and then delete the PDFs.
    • Fr0zenFyr
      Fr0zenFyr over 7 years
  • Fr0zenFyr
    Fr0zenFyr almost 10 years
    I was thinking of asking you to help me modify the code from same answer, you are a mind reader. I will try this code and report the progress here. I'm really glad you took time to read my question carefully and covered almost every aspect of it.
  • private_meta
    private_meta almost 10 years
    Also, if you use more than one "param", you need to call them like this: "./script.ps -folderPath path -delete" and so on. For switches, refer to this
  • Fr0zenFyr
    Fr0zenFyr almost 10 years
    Thanks mate, I tip my hat to you. This script achieved almost everything that I wanted(except the log file thing). Since there has been no better answer than this, I accept your answer as the solution. Ohh, and BTW, by default my system's PowerShell didn't allow me to run the script saying it is disabled. I had two choices, either signing the script or executing set-ExecutionPolicy Unrestricted in PowerShell as Administrator. I tried both and they worked, though the 1st is better choice but out of this comment's scope to explain why.
  • Fr0zenFyr
    Fr0zenFyr almost 10 years
    Hi again, the script worked beautifully except in one case I found out. Few of my zip files had sub folders, the script extracted the folder and placed its contents parallel to it(outside sub-dir). Can this be fixed somehow? Also, I had few files which were .tar and .zip inside them, so what should I replace if($_.Name -match "^*.'.zip$") with to process them recursively? Thanks in advance.
  • private_meta
    private_meta almost 10 years
    If you replace $arguments=@("e", with $arguments=@("x", it should preserve directory structure, please test that. About recursive extraction, I don't know if it works properly like that, but what you can do is have the script call itself with a new directory, in this case every subdirectory. If there is a zip file in a root location of the folder, it will unpack it. Otherwise, it will get a lot more complicated. I'm not good enough with powershell though.
  • Fr0zenFyr
    Fr0zenFyr almost 10 years
    I started disliking Power Shell now, it seems confusing and complicated. I'm trying to manage this with a batch script now, I already did much of it in just 1 line. Thanks mate for the reply though. I just posted a question on SO, you can see my progress there.