Sort very large text file in PowerShell

22,022

Solution 1

Get-Content is terribly ineffective for reading large files. Sort-Object is not very fast, too.

Let's set up a base line:

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$c = Get-Content .\log3.txt -Encoding Ascii
$sw.Stop();
Write-Output ("Reading took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$s = $c | Sort-Object;
$sw.Stop();
Write-Output ("Sorting took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u = $s | Get-Unique
$sw.Stop();
Write-Output ("uniq took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u | Out-File 'result.txt' -Encoding ascii
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);

With a 40 MB file having 1.6 million lines (made of 100k unique lines repeated 16 times) this script produces the following output on my machine:

Reading took 00:02:16.5768663
Sorting took 00:02:04.0416976
uniq took 00:01:41.4630661
saving took 00:00:37.1630663

Totally unimpressive: more than 6 minutes to sort tiny file. Every step can be improved a lot. Let's use StreamReader to read file line by line into HashSet which will remove duplicates, then copy data to List and sort it there, then use StreamWriter to dump results back.

$hs = new-object System.Collections.Generic.HashSet[string]
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$reader = [System.IO.File]::OpenText("D:\log3.txt")
try {
    while (($line = $reader.ReadLine()) -ne $null)
    {
        $t = $hs.Add($line)
    }
}
finally {
    $reader.Close()
}
$sw.Stop();
Write-Output ("read-uniq took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$ls = new-object system.collections.generic.List[string] $hs;
$ls.Sort();
$sw.Stop();
Write-Output ("sorting took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
try
{
    $f = New-Object System.IO.StreamWriter "d:\result2.txt";
    foreach ($s in $ls)
    {
        $f.WriteLine($s);
    }
}
finally
{
    $f.Close();
}
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);

this script produces:

read-uniq took 00:00:32.2225181
sorting took 00:00:00.2378838
saving took 00:00:01.0724802

On same input file it runs more than 10 times faster. I am still surprised though it takes 30 seconds to read file from disk.

Solution 2

I've grown to hate this part of windows powershell, it is a memory hog on these larger files. One trick is to read the lines [System.IO.File]::ReadLines('file.txt') | sort -u | out-file file2.txt -encoding ascii

Another trick, seriously is to just use linux.

cat file.txt | sort -u > output.txt

Linux is so insanely fast at this, it makes me wonder what the heck microsoft is thinking with this set up.

It may not be feasible in all cases, and i understand, but if you have a linux machine, you can copy 500 megs to it, sort and unique it, and copy it back in under a couple minutes.

Share:
22,022
Predrag Vasić
Author by

Predrag Vasić

Updated on July 09, 2022

Comments

  • Predrag Vasić
    Predrag Vasić almost 2 years

    I have standard Apache log files, between 500Mb and 2GB in size. I need to sort the lines in them (each line starts with a date yyyy-MM-dd hh:mm:ss, so no treatment necessary for sorting.

    The simplest and most obvious thing that comes to mind is

     Get-Content unsorted.txt | sort | get-unique > sorted.txt
    

    I am guessing (without having tried it) that doing this using Get-Content would take forever in my 1GB files. I don't quite know my way around System.IO.StreamReader, but I'm curious if an efficient solution could be put together using that?

    Thanks to anyone who might have a more efficient idea.

    [edit]

    I tried this subsequently, and it took a very long time; some 10 minutes for 400MB.

  • n0rd
    n0rd over 8 years
    sorting one large chunk is not slower than several smaller chunks, provided all data fits into memory (i.e. nothing spills to swap)
  • E.Z. Hart
    E.Z. Hart over 8 years
    @n0rd - it would depend on the size of the file, how much memory the machine has available, the algorithm Sort-Object uses, and how close to sorted the data is beforehand.
  • E.Z. Hart
    E.Z. Hart over 8 years
    Give Measure-Command a try: technet.microsoft.com/en-us/library/…
  • n0rd
    n0rd over 8 years
    On same input data sorting whole set would never be slower than sorting chunks with same algorithm and then merging. For external sorting (when all data does not fit into memory), yes, you have to split, sort and merge. Otherwise there is no gain to do so.
  • n0rd
    n0rd over 8 years
    Revision: above is true for any decent (O(nlogn) time complexity) sorting algorithm (otherwise they could be sped up by splitting, sorting and merging), but not true for any worse algorithms. I am pretty sure Sort-Object uses something decent. Pushing data through pipeline may contribute a lot to execution time, though.
  • E.Z. Hart
    E.Z. Hart over 8 years
    I'll update my answer to be more clear about the (potential) problem it's fixing.
  • Predrag Vasić
    Predrag Vasić over 8 years
    This is a significant performance improvement, however, the target file is noticeably smaller than the source. Duplicate entries seem deleted, which I don't want it to do. All I need it to do is sort the lines alphabetically; if there are multiple identical lines, keep them all. Thanks for the help!
  • n0rd
    n0rd over 8 years
    Your sample code called Get-Unique which removes duplicates. If you don't need it, then just read directly to List and sort, no need to use HashSet here.
  • Jakub P
    Jakub P over 6 years
    Perhaps file read improves if file is read as a whole, not line by line.
  • n0rd
    n0rd over 6 years
    @JakubP, I highly doubt that. Breaking into lines have to happen at some point, either while reading from disk, or while reading from memory, and I expect buffering will make the difference between those two negligible.
  • Carsten
    Carsten almost 4 years
    Of course it is not as fast as a StreamReader or a [System.IO.File]::OpenText but on the other hand it does not create any peak-load on the file-system when using it in blocks.
  • Nawar
    Nawar almost 4 years
    This is not powershell. It is C#.