How to use Powershell to list duplicate files in a folder structure that exist in one of the folders

14,716

Solution 1

1

I stared at this for a while, determined to write it without studying the existing answers, but I'd already glanced at the first sentence of Matt's answer mentioning Group-Object. After some different approaches, I get basically the same answer, except his is long-form and robust with regex character escaping and setup variables, mine is terse because you asked for shorter answers and because that's more fun.

$inc = '^c:\\s\\includes'
$cs = (gci -R 'c:\s' -File -I *.cs) | group name
$nopes = $cs |?{($_.Group.FullName -notmatch $inc)-and($_.Group.FullName -match $inc)}
$nopes | % {$_.Name; $_.Group.FullName}

Example output:

someFile.cs
c:\s\includes\wherever\someFile.cs
c:\s\lib\factories\alt\someFile.cs
c:\s\contrib\users\aa\testing\someFile.cs

The concept is:

  1. Get all the .cs files in the whole source tree
  2. Split them into groups of {filename: {files which share this filename}}
  3. For each group, keep only those where the set of files contains any file with a path that matches the include folder and contains any file with a path that does not match the includes folder. This step covers
    1. duplicates (if a file only exists once it cannot pass both tests)
    2. duplicates across the {includes/not-includes} divide, instead of being duplicated within one branch
    3. handles triplicates, n-tuplicates, as well.

Edit: I added the ^ to $inc to say it has to match at the start of the string, so the regex engine can fail faster for paths that don't match. Maybe this counts as premature optimization.


2

After that pretty dense attempt, the shape of a cleaner answer is much much easier:

  1. Get all the files, split them into include, not-include arrays.
  2. Nested for-loop testing every file against every other file.

Longer, but enormously quicker to write (it runs slower, though) and I imagine easier to read for someone who doesn't know what it does.

$sourceTree = 'c:\\s'

$allFiles = Get-ChildItem $sourceTree -Include '*.cs' -File -Recurse

$includeFiles = $allFiles | where FullName -imatch "$($sourceTree)\\includes"
$otherFiles = $allFiles | where FullName -inotmatch "$($sourceTree)\\includes"

foreach ($incFile in $includeFiles) {
    foreach ($oFile in $otherFiles) {
        if ($incFile.Name -ieq $oFile.Name) {
            write "$($incFile.Name) clash"
            write "* $($incFile.FullName)"
            write "* $($oFile.FullName)"
            write "`n"
        }
    }
}

3

Because code-golf is fun. If the hashtables are faster, what about this even less tested one-liner...

$h=@{};gci c:\s -R -file -Filt *.cs|%{$h[$_.Name]+=@($_.FullName)};$h.Values|?{$_.Count-gt1-and$_-like'c:\s\includes*'}

Edit: explanation of this version: It's doing much the same solution approach as version 1, but the grouping operation happens explicitly in the hashtable. The shape of the hashtable becomes:

$h = {
    'fileA.cs': @('c:\cs\wherever\fileA.cs', 'c:\cs\includes\fileA.cs'),
    'file2.cs': @('c:\cs\somewhere\file2.cs'),
    'file3.cs': @('c:\cs\includes\file3.cs', 'c:\cs\x\file3.cs', 'c:\cs\z\file3.cs')
}

It hits the disk once for all the .cs files, iterates the whole list to build the hashtable. I don't think it can do less work than this for that bit.

It uses +=, so it can add files to the existing array for that filename, otherwise it would overwrite each of the hashtable lists and they would be one item long for only the most recently seen file.

It uses @() - because when it hits a filename for the first time, $h[$_.Name] won't return anything, and the script needs put an array into the hashtable at first, not a string. If it was +=$_.FullName then the first file would go into the hashtable as a string and the += next time would do string concatenation and that's no use to me. This forces the first file in the hashtable to start an array by forcing every file to be a one item array. The least-code way to get this result is with +=@(..) but that churn of creating throwaway arrays for every single file is needless work. Maybe changing it to longer code which does less array creation would help?

Changing the section

%{$h[$_.Name]+=@($_.FullName)}

to something like

%{if (!$h.ContainsKey($_.Name)){$h[$_.Name]=@()};$h[$_.Name]+=$_.FullName}

(I'm guessing, I don't have much intuition for what's most likely to be slow PowerShell code, and haven't tested).

After that, using h.Values isn't going over every file for a second time, it's going over every array in the hashtable - one per unique filename. That's got to happen to check the array size and prune the not-duplicates, but the -and operation short circuits - when the Count -gt 1 fails, the so the bit on the right checking the path name doesn't run.

If the array has two or more files in it, the -and $_ -like ... executes and pattern matches to see if at least one of the duplicates is in the includes path. (Bug: if all the duplicates are in c:\cs\includes and none anywhere else, it will still show them).

--

4

This is edited version 3 with the hashtable initialization tweak, and now it keeps track of seen files in $s, and then only considers those it's seen more than once.

$h=@{};$s=@{};gci 'c:\s' -R -file -Filt *.cs|%{if($h.ContainsKey($_.Name)){$s[$_.Name]=1}else{$h[$_.Name]=@()}$h[$_.Name]+=$_.FullName};$s.Keys|%{if ($h[$_]-like 'c:\s\includes*'){$h[$_]}}

Assuming it works, that's what it does, anyway.

-- Edit branch of topic; I keep thinking there ought to be a way to do this with the things in the System.Data namespace. Anyone know if you can connect System.Data.DataTable().ReadXML() to gci | ConvertTo-Xml without reams of boilerplate?

Solution 2

I'd do more or less the same, except I'd build the hashtable from the contents of the includes folder and then run over everything else to check for duplicates:

$root     = 'C:\s'
$includes = "$root\includes"

$includeList = @{}
Get-ChildItem -Path $includes -Filter '*.cs' -Recurse -File |
  % { $includeList[$_.Name] = $_.DirectoryName }

Get-ChildItem -Path $root -Filter '*.cs' -Recurse -File |
  ? { $_.FullName -notlike "$includes\*" -and $includeList.Contains($_.Name) } |
  % { "Duplicate of '{0}': {1}" -f $includeList[$_.Name], $_.FullName }

Solution 3

I'm not as impressed with this as I would like but I thought that Group-Object might have a place in this question so I present the following:

$base = 'C:\s'
$unique = "$base\includes"
$extension = "*.cs"

Get-ChildItem -Path $base -Filter $extension -Recurse | 
        Group-Object $_.Name | 
        Where-Object{($_.Count -gt 1) -and (($_.Group).FullName -match [regex]::Escape($unique))} | 
        ForEach-Object {
            $filename = $_.Name
            ($_.Group).FullName -notmatch [regex]::Escape($unique) | ForEach-Object{
                "'{0}' has file with same name as '{1}'" -f (Split-Path $_),$filename
            }
        }

Collect all the files with the extension filter $extension. Group the files based on their names. Then of those groups find every group where there are more than one of that particular file and one of the group members is at least in the directory $unique. Take those groups and print out all the files that are not from the unique directory.

From Comment

For what its worth this is what I used for testing to create a bunch of files. (I know the folder 9 is empty)

$base = "E:\Temp\dev\cs"
Remove-Item "$base\*" -Recurse -Force
0..9 | %{[void](New-Item -ItemType directory "$base\$_")}
1..1000 | %{
    $number = Get-Random -Minimum 1 -Maximum 100
    $folder = Get-Random -Minimum 0 -Maximum 9
    [void](New-Item -Path $base\$folder -ItemType File -Name "$number.txt" -Force)
}

Solution 4

After looking at all the others, I thought I would try a different approach.

$includes = "C:\s\includes"
$root = "C:\s"

# First script
Measure-Command {
    [string[]]$filter = ls $includes -Filter *.cs -Recurse | % name
    ls $root -include $filter -Recurse -Filter *.cs | 
        Where-object{$_.FullName -notlike "$includes*"}
}

# Second Script
Measure-Command {
    $filter2 = ls $includes -Filter *.cs -Recurse 
    ls $root -Recurse -Filter *.cs | 
        Where-object{$filter2.name -eq $_.name -and $_.FullName -notlike "$includes*"}
}

In my first script, I get all the include files into a string array. Then i use that string array as a include param on the get-childitem. In the end, I filter out the include folder from the results.

In my second script, I enumerate everything and then filter after the pipe.

Remove the measure-command to see the results. I was using that to check the speed. With my dataset, the first one was 40% faster.

Share:
14,716
zumalifeguard
Author by

zumalifeguard

http://about.me/kasajian

Updated on June 14, 2022

Comments

  • zumalifeguard
    zumalifeguard almost 2 years

    I have a source tree, say c:\s, with many sub-folders. One of the sub-folders is called "c:\s\Includes" which can contain one or more .cs files recursively.

    I want to make sure that none of the .cs files in the c:\s\Includes... path exist in any other folder under c:\s, recursively.

    I wrote the following PowerShell script which works, but I'm not sure if there's an easier way to do it. I've had less than 24 hours experience with PowerShell so I have a feeling there's a better way.

    I can assume at least PowerShell 3 being used.

    I will accept any answer that improves my script, but I'll wait a few days before accepting the answer. When I say "improve", I mean it makes it shorter, more elegant or with better performance.

    Any help from anyone would be greatly appreciated.

    The current code:

    $excludeFolder = "Includes"
    
    $h = @{}
    foreach ($i in ls $pwd.path *.cs -r -file | ? DirectoryName -notlike ("*\" + $excludeFolder + "\*")) { $h[$i.Name]=$i.DirectoryName }
    ls ($pwd.path + "\" + $excludeFolder) *.cs -r -file | ? { $h.Contains($_.Name) } | Select @{Name="Duplicate";Expression={$h[$_.Name] + " has file with same name as " + $_.Fullname}}
    
  • zumalifeguard
    zumalifeguard over 9 years
    I can't comment on whether it works or not but when I run it on my source tree (about a thousand source files), the script just keeps running and doesn't return. I'm not sure what it's doing. It's been a few minutes. The original script runs with a few seconds. Ansgar's improvements make it even faster. Performance is not a hard requirement for this script, but it should be orders of magnitude slower. I'd be curious what data-set you tested in on -- I can try to run it against that to see if I can figure out what's wrong with it.
  • Matt
    Matt over 9 years
    @zumalifeguard didnt really test it for performace... wonder why it is taking so long. My test base was with txt files with about 20 and 6 of them would have been registered as duplicates. Will expand to 100's and see what happens. Thanks for letting me know.
  • zumalifeguard
    zumalifeguard over 9 years
    Your short one works.. It's kinda cool how short it is. But you know, it's like about 10 times slower than Ansgar's version. I mean, it's not bad. But the one with the dictionaries takes a split second, whereas this version takes about 5 seconds.
  • zumalifeguard
    zumalifeguard over 9 years
    Your longer one at the end I don't think does the right thing. I didn't debug it, but when I run it shows clashes are not for .cs files. Also, I don't see ".cs" referenced anywhere. Are you looking for all clashes? 'cause it's catching .dlls and stuff. But thanks for doing it anyway.
  • TessellatingHeckler
    TessellatingHeckler over 9 years
    In the second one: whoops yes I was -Including * in my tests instead of *.cs files and missed editing it when posting; I've updated the post to correct that. On my first script, that's interesting about the speed comparison. Would you try changing the gci -I *.cs to -Filter *.cs and see what difference that makes? (I expect it's faster, but I held off because -Filter seems to return wrong results sometimes).
  • TessellatingHeckler
    TessellatingHeckler over 9 years
    @zumalifeguard I've fixed that problem (I did miss changing * to *.cs when testing code 2. Code 2 is slow, it was just quick to write and simple structure. Added a third hashtable one-liner, as well.
  • zumalifeguard
    zumalifeguard over 9 years
    I tried your updated solution #2, and it works. It takes about 7 seconds to run vs. the original which is about under a second.
  • zumalifeguard
    zumalifeguard over 9 years
    I tried your latest one (3) and it works. I like the terseness. It's fast, but I do have to point out that unlike the original solution, this one requires to scan the entire list twice. That's not terrible, but not really ideal. What you could do is that when you're scanning over the list the first time, and adding values to the hashmap, check if there's already an entry there, if there is, then add that name to some other list. That way, when you're done with the first scan, you have a short list of conflicting ones that you can scan through.
  • zumalifeguard
    zumalifeguard over 9 years
    I also liked your use of "%" a the clever idea of of using "-gt1". This would be an acceptable answer if it wasn't for Ansgar's being slightly better. I'm realizing that the performance is a key issue for this script and would have emphasized that in the original question had I realized it.
  • zumalifeguard
    zumalifeguard over 9 years
    Why does script 3 use += instead of =?
  • TessellatingHeckler
    TessellatingHeckler over 9 years
    @zumalifeguard : I've edited, adding an explanation of code 3 including why it uses +=. It doesn't exactly scan the entire list twice, although if there are 99% unique filenames and 1% duplicates it approximately does. I like the idea of keeping a separate list of duplicates.
  • zumalifeguard
    zumalifeguard over 9 years
    Thank you. When you say "isn't going over every file for a second time, it's going over every array in the hashtable". The number of arrays in the hashtable is the number files - minus duplicates. The duplicates ought to be minimal. So if I had a disk that's got 10,000 files, then the script going of the 10,000 times once to collect all the files. Let's say it found 50 duplicates, so the hash-table now contains 9,950 elements, which has to rescan. Ideally, since you encountered those 50 elements once already, you keep track of them, so the second loop only iterates 50 times.
  • zumalifeguard
    zumalifeguard over 9 years
    I like your final solution. I also like the fact that you're engaged in this question and spent a considerable amount of time tweaking it. And I learned a lot from you. So given all of that, and actually delivering a terse solution, makes this best Answer.
  • TessellatingHeckler
    TessellatingHeckler over 9 years
    @zumalifeguard : Thanks. There's now an edit with a version 4 (also barely tested), but with the "keeping a separate list" adjustment. It does $s[$_.Name]=1 but it's not a counter, it's using 1 as a flag (ought to use $true for sense, but 1 is shorter) and it's a hashtable instead of a list because a list would end up with each duplicate filename N times over. HashTable.Keys will only have them once.
  • TessellatingHeckler
    TessellatingHeckler over 9 years
    I like that approach of using the include filter; would it work having both -include and -exclude in the first script, and no 'where-object' section? It would pretty much be ls 'c:\s' -Filt *.cs -File -R -Ex 'c:\s\includes' -I ([string[]](ls 'c:\s\includes' -File -R *.cs))|%FullName. A quick test looks like that works; it's way better than my answers.
  • kevmar
    kevmar over 9 years
    -include and -exclude are deceiving. They filter on the filename only and don't look at the path of the file.