PowerShell get number of lines of big (large) file

81,002

Solution 1

Use Get-Content -Read $nLinesAtTime to read your file part by part:

$nlines = 0;

# Read file by 1000 lines at a time
gc $YOURFILE -read 1000 | % { $nlines += $_.Length };
[string]::Format("{0} has {1} lines", $YOURFILE, $nlines)

And here is simple, but slow script to validate work on a small file:

gc $YOURFILE | Measure-Object -Line

Solution 2

Here's a PowerShell script I cobbled together which demonstrates a few different methods of counting lines in a text file, along with the time and memory required for each method. The results (below) show clear differences in the time and memory requirements. For my tests, it looks like the sweet spot was Get-Content, using a ReadCount setting of 100. The other tests required significantly more time and/or memory usage.

#$testFile = 'C:\test_small.csv' # 245 lines, 150 KB
#$testFile = 'C:\test_medium.csv' # 95,365 lines, 104 MB
$testFile = 'C:\test_large.csv' # 285,776 lines, 308 MB

# Using ArrayList just because they are faster than Powershell arrays, for some operations with large arrays.
$results = New-Object System.Collections.ArrayList

function AddResult {
param( [string] $sMethod, [string] $iCount )
    $result = New-Object -TypeName PSObject -Property @{
        "Method" = $sMethod
        "Count" = $iCount
        "Elapsed Time" = ((Get-Date) - $dtStart)
        "Memory Total" = [System.Math]::Round((GetMemoryUsage)/1mb, 1)
        "Memory Delta" = [System.Math]::Round(((GetMemoryUsage) - $dMemStart)/1mb, 1)
    }
    [void]$results.Add($result)
    Write-Output "$sMethod : $count"
    [System.GC]::Collect()
}

function GetMemoryUsage {
    # return ((Get-Process -Id $pid).PrivateMemorySize)
    return ([System.GC]::GetTotalMemory($false))
}

# Get-Content -ReadCount 1
[System.GC]::Collect()
$dMemStart = GetMemoryUsage
$dtStart = Get-Date
$count = 0
Get-Content -Path $testFile -ReadCount 1 |% { $count++ }
AddResult "Get-Content -ReadCount 1" $count

# Get-Content -ReadCount 10,100,1000,0
# Note: ReadCount = 1 returns a string.  Any other value returns an array of strings.
# Thus, the Count property only applies when ReadCount is not 1.
@(10,100,1000,0) |% {
    $dMemStart = GetMemoryUsage
    $dtStart = Get-Date
    $count = 0
    Get-Content -Path $testFile -ReadCount $_ |% { $count += $_.Count }
    AddResult "Get-Content -ReadCount $_" $count
}

# Get-Content | Measure-Object
$dMemStart = GetMemoryUsage
$dtStart = Get-Date
$count = (Get-Content -Path $testFile -ReadCount 1 | Measure-Object -line).Lines
AddResult "Get-Content -ReadCount 1 | Measure-Object" $count

# Get-Content.Count
$dMemStart = GetMemoryUsage
$dtStart = Get-Date
$count = (Get-Content -Path $testFile -ReadCount 1).Count
AddResult "Get-Content.Count" $count

# StreamReader.ReadLine
$dMemStart = GetMemoryUsage
$dtStart = Get-Date
$count = 0
# Use this constructor to avoid file access errors, like Get-Content does.
$stream = New-Object -TypeName System.IO.FileStream(
    $testFile,
    [System.IO.FileMode]::Open,
    [System.IO.FileAccess]::Read,
    [System.IO.FileShare]::ReadWrite)
if ($stream) {
    $reader = New-Object IO.StreamReader $stream
    if ($reader) {
        while(-not ($reader.EndOfStream)) { [void]$reader.ReadLine(); $count++ }
        $reader.Close()
    }
    $stream.Close()
}

AddResult "StreamReader.ReadLine" $count

$results | Select Method, Count, "Elapsed Time", "Memory Total", "Memory Delta" | ft -auto | Write-Output

Here are results for text file containing ~95k lines, 104 MB:

Method                                    Count Elapsed Time     Memory Total Memory Delta
------                                    ----- ------------     ------------ ------------
Get-Content -ReadCount 1                  95365 00:00:11.1451841         45.8          0.2
Get-Content -ReadCount 10                 95365 00:00:02.9015023         47.3          1.7
Get-Content -ReadCount 100                95365 00:00:01.4522507         59.9         14.3
Get-Content -ReadCount 1000               95365 00:00:01.1539634         75.4         29.7
Get-Content -ReadCount 0                  95365 00:00:01.3888746          346        300.4
Get-Content -ReadCount 1 | Measure-Object 95365 00:00:08.6867159         46.2          0.6
Get-Content.Count                         95365 00:00:03.0574433        465.8        420.1
StreamReader.ReadLine                     95365 00:00:02.5740262         46.2          0.6

Here are results for a larger file (containing ~285k lines, 308 MB):

Method                                    Count  Elapsed Time     Memory Total Memory Delta
------                                    -----  ------------     ------------ ------------
Get-Content -ReadCount 1                  285776 00:00:36.2280995         46.3          0.8
Get-Content -ReadCount 10                 285776 00:00:06.3486006         46.3          0.7
Get-Content -ReadCount 100                285776 00:00:03.1590055         55.1          9.5
Get-Content -ReadCount 1000               285776 00:00:02.8381262         88.1         42.4
Get-Content -ReadCount 0                  285776 00:00:29.4240734        894.5        848.8
Get-Content -ReadCount 1 | Measure-Object 285776 00:00:32.7905971         46.5          0.9
Get-Content.Count                         285776 00:00:28.4504388       1219.8       1174.2
StreamReader.ReadLine                     285776 00:00:20.4495721           46          0.4

Solution 3

Here is a one-liner based on Pseudothink's post.

Rows in one specific file:

"the_name_of_your_file.txt" |% {$n = $_; $c = 0; Get-Content -Path $_ -ReadCount 1000 |% { $c += $_.Count }; "$n; $c"}

All files in current dir (individually):

Get-ChildItem "." |% {$n = $_; $c = 0; Get-Content -Path $_ -ReadCount 1000 |% { $c += $_.Count }; "$n; $c"}

Explanation:

"the_name_of_your_file.txt" -> does nothing, just provides the filename for next steps, needs to be double quoted
|% -> alias ForEach-Object, iterates over items provided (just one in this case), accepts piped content as an input, current item saved to $_
$n = $_ -> $n as name of the file provided is saved for later from $_, actually this may not be needed
$c = 0 -> initialisation of $c as count
Get-Content -Path $_ -ReadCount 1000 -> read 1000 lines from file provided (see other answers of the thread)
|% -> foreach do add numbers of rows actually read to $c (will be like 1000 + 1000 + 123)
"$n; $c" -> once finished reading file, print name of file; count of rows
Get-ChildItem "." -> just adds more items to the pipe than single filename did

Solution 4

The first thing to try is to stream Get-Content and build up the line count one at a time, rather that storing all lines in an array at once. I think that this will give proper streaming behavior - i.e. the entire file will not be in memory at once, just the current line.

$lines = 0
Get-Content .\File.txt |%{ $lines++ }

And as the other answer suggests, adding -ReadCount could speed this up.

If that doesn't work for you (too slow or too much memory) you could go directly to a StreamReader:

$count = 0
$reader = New-Object IO.StreamReader 'c:\logs\MyLog.txt'
while($reader.ReadLine() -ne $null){ $count++ }
$reader.Close()  # Don't forget to do this. Ideally put this in a try/finally block to make sure it happens.

Solution 5

Here's another solution that uses .NET:

[Linq.Enumerable]::Count([System.IO.File]::ReadLines("FileToCount.txt"))

It's not very interruptible, but it's very easy on memory.

Share:
81,002

Related videos on Youtube

Pranav
Author by

Pranav

Updated on July 09, 2022

Comments

  • Pranav
    Pranav almost 2 years

    One of the ways to get number of lines from a file is this method in PowerShell:

    PS C:\Users\Pranav\Desktop\PS_Test_Scripts> $a=Get-Content .\sub.ps1
    PS C:\Users\Pranav\Desktop\PS_Test_Scripts> $a.count
    34
    PS C:\Users\Pranav\Desktop\PS_Test_Scripts> 
    

    However, when I have a large 800 MB text file, how do I get the line number from it without reading the whole file?

    The above method will consume too much RAM resulting in crashing the script or taking too long to complete.

  • Fares
    Fares over 8 years
    Using the IO.StreamReader code above fixed the out of memory errors I was getting when using the gc method below. I can confirm that it consumes a lot less memory ( using PowerShell 5.0.10514.6)
  • Vladislav
    Vladislav about 7 years
    It is worth pointing that your second approach counts only the lines with text. If there are empty lines, they are not counted.
  • Adarsha
    Adarsha over 5 years
    With latest version of [System.IO.File] (i.e. >.Net 4.0) this is the most efficient as it does a lazy read, and does not create huge objects since it does not need to read more than one line at a time.
  • reggaeguitar
    reggaeguitar about 5 years
    I had to use the full file path as the arg to ReadLines, relative pathing didn't seem to work
  • João Ciocca
    João Ciocca over 4 years
    tested with a 2GB file. ~34 seconds to count 3654346 lines. Thank you, sir!
  • Mark
    Mark about 4 years
    @[João Ciocca] If you get a minute, try using $count = 0; switch -File $filepath { default { ++$count } } ... in some of my tests it is faster. :)
  • Farbod Ahmadian
    Farbod Ahmadian almost 4 years
    Please provide more detail for further information.
  • dburtsev
    dburtsev almost 4 years
    $fileName = 'C:\dirname\filename.txt' <br/> CMD /C ('find /v /c "" "' + $fileName + '"')<br/> CMD - Starts a new instance of the command interpreter, Cmd.exe from PowerShell script.<br/> /C - Carries out the command specified by string and then stops.<br/> MS DOS FIND - a filter command. FIND searches for a string of characters you enter in the files you name.<br/> Syntax: FIND [/V][/C][/I][/N] string [d:][path]filename[...]<br/> /C - Displays only the count of the number of lines that contained a match in each of the files. <br/> In this case we calculate number of new lines.