Sort very large text file in PowerShell
Solution 1
Get-Content
is terribly ineffective for reading large files. Sort-Object
is not very fast, too.
Let's set up a base line:
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$c = Get-Content .\log3.txt -Encoding Ascii
$sw.Stop();
Write-Output ("Reading took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$s = $c | Sort-Object;
$sw.Stop();
Write-Output ("Sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u = $s | Get-Unique
$sw.Stop();
Write-Output ("uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u | Out-File 'result.txt' -Encoding ascii
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
With a 40 MB file having 1.6 million lines (made of 100k unique lines repeated 16 times) this script produces the following output on my machine:
Reading took 00:02:16.5768663
Sorting took 00:02:04.0416976
uniq took 00:01:41.4630661
saving took 00:00:37.1630663
Totally unimpressive: more than 6 minutes to sort tiny file. Every step can be improved a lot. Let's use StreamReader
to read file line by line into HashSet
which will remove duplicates, then copy data to List
and sort it there, then use StreamWriter
to dump results back.
$hs = new-object System.Collections.Generic.HashSet[string]
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$reader = [System.IO.File]::OpenText("D:\log3.txt")
try {
while (($line = $reader.ReadLine()) -ne $null)
{
$t = $hs.Add($line)
}
}
finally {
$reader.Close()
}
$sw.Stop();
Write-Output ("read-uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$ls = new-object system.collections.generic.List[string] $hs;
$ls.Sort();
$sw.Stop();
Write-Output ("sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
try
{
$f = New-Object System.IO.StreamWriter "d:\result2.txt";
foreach ($s in $ls)
{
$f.WriteLine($s);
}
}
finally
{
$f.Close();
}
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
this script produces:
read-uniq took 00:00:32.2225181
sorting took 00:00:00.2378838
saving took 00:00:01.0724802
On same input file it runs more than 10 times faster. I am still surprised though it takes 30 seconds to read file from disk.
Solution 2
I've grown to hate this part of windows powershell, it is a memory hog on these larger files. One trick is to read the lines [System.IO.File]::ReadLines('file.txt') | sort -u | out-file file2.txt -encoding ascii
Another trick, seriously is to just use linux.
cat file.txt | sort -u > output.txt
Linux is so insanely fast at this, it makes me wonder what the heck microsoft is thinking with this set up.
It may not be feasible in all cases, and i understand, but if you have a linux machine, you can copy 500 megs to it, sort and unique it, and copy it back in under a couple minutes.
Predrag Vasić
Updated on July 09, 2022Comments
-
Predrag Vasić almost 2 years
I have standard Apache log files, between 500Mb and 2GB in size. I need to sort the lines in them (each line starts with a date yyyy-MM-dd hh:mm:ss, so no treatment necessary for sorting.
The simplest and most obvious thing that comes to mind is
Get-Content unsorted.txt | sort | get-unique > sorted.txt
I am guessing (without having tried it) that doing this using
Get-Content
would take forever in my 1GB files. I don't quite know my way aroundSystem.IO.StreamReader
, but I'm curious if an efficient solution could be put together using that?Thanks to anyone who might have a more efficient idea.
[edit]
I tried this subsequently, and it took a very long time; some 10 minutes for 400MB.
-
n0rd over 8 yearssorting one large chunk is not slower than several smaller chunks, provided all data fits into memory (i.e. nothing spills to swap)
-
E.Z. Hart over 8 years@n0rd - it would depend on the size of the file, how much memory the machine has available, the algorithm Sort-Object uses, and how close to sorted the data is beforehand.
-
E.Z. Hart over 8 yearsGive Measure-Command a try: technet.microsoft.com/en-us/library/…
-
n0rd over 8 yearsOn same input data sorting whole set would never be slower than sorting chunks with same algorithm and then merging. For external sorting (when all data does not fit into memory), yes, you have to split, sort and merge. Otherwise there is no gain to do so.
-
n0rd over 8 yearsRevision: above is true for any decent (O(nlogn) time complexity) sorting algorithm (otherwise they could be sped up by splitting, sorting and merging), but not true for any worse algorithms. I am pretty sure Sort-Object uses something decent. Pushing data through pipeline may contribute a lot to execution time, though.
-
E.Z. Hart over 8 yearsI'll update my answer to be more clear about the (potential) problem it's fixing.
-
Predrag Vasić over 8 yearsThis is a significant performance improvement, however, the target file is noticeably smaller than the source. Duplicate entries seem deleted, which I don't want it to do. All I need it to do is sort the lines alphabetically; if there are multiple identical lines, keep them all. Thanks for the help!
-
n0rd over 8 yearsYour sample code called
Get-Unique
which removes duplicates. If you don't need it, then just read directly toList
and sort, no need to useHashSet
here. -
Jakub P over 6 yearsPerhaps file read improves if file is read as a whole, not line by line.
-
n0rd over 6 years@JakubP, I highly doubt that. Breaking into lines have to happen at some point, either while reading from disk, or while reading from memory, and I expect buffering will make the difference between those two negligible.
-
Carsten almost 4 yearsOf course it is not as fast as a StreamReader or a [System.IO.File]::OpenText but on the other hand it does not create any peak-load on the file-system when using it in blocks.
-
Nawar almost 4 yearsThis is not powershell. It is C#.