Editing the first/last lines of a 1GB+ text file on Windows without loading the entire file into memory

5,660

Solution 1

Removing content from the beginning of a file requires rewriting the file.

You can use tail -n +4 input.csv > output.csv to remove the first three lines (requires 105 seconds for a 15 GB Wikipedia dump on my low-end server, i.e. about 150 MB per second). On Windows tail is available with Cygwin e.g.

Solution 2

I guess there's no way to not read the whole file in the memory, at least I don't know any.

$csv = gci "C:\location" -filter *.csv | % { 
    (Get-Content $_.FullName | select -skip 3) | Set-Content $_.FullName 
    Add-Content -path $_.FullName -value ""
}

This would be a PowerShell solution which requires to load the whole file into memory.

  • search every csv from a location with gci,
  • loop over the found csv files with foreach alias %,
  • get their whole content (can take some time) with get-content,
  • select everything but skip the first 3 lines select -skip
  • and set that content to the file with set-content.
  • the last line will add a new line to the file add-content

Edit: You can try to make this whole thing faster by adding the -ReadCount Parameter to your get-content call.

-ReadCount (int)

Specifies how many lines of content are sent through the pipeline at a time. The default value is 1. A value of 0 (zero) sends all of the content at one time.

This parameter does not change the content displayed, but it does affect the time it takes to display the content. As the value of ReadCount increases, the time it takes to return the first line increases, but the total time for the operation decreases. This can make a perceptible difference in very large items.

Edit2: I tested get-content with readcount. sadly i couldn't find a text file larger than 89mb. but the difference is already significant:

PS C:\Windows\System32> Measure-Command { gc "C:\Pub.log" -readcount 0 }


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 1
Milliseconds      : 22
Ticks             : 10224578
TotalDays         : 1.18340023148148E-05
TotalHours        : 0.000284016055555556
TotalMinutes      : 0.0170409633333333
TotalSeconds      : 1.0224578
TotalMilliseconds : 1022.4578




PS C:\Windows\System32> Measure-Command { gc "C:\Pub.log" -readcount 1 }


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 10
Milliseconds      : 594
Ticks             : 105949457
TotalDays         : 0.000122626686342593
TotalHours        : 0.00294304047222222
TotalMinutes      : 0.176582428333333
TotalSeconds      : 10.5949457
TotalMilliseconds : 10594.9457

so get-content $_.FullName -readcount 0 is the way to go

Share:
5,660

Related videos on Youtube

Wouter
Author by

Wouter

Updated on September 18, 2022

Comments

  • Wouter
    Wouter almost 2 years

    I have some flat-text data files ("CSV") with sizes up to 3GB and simply need to remove the first 3 lines of text, and add an empty line at the end. Since I have a lot of these files, I would like to find a fast way of doing this.

    The problem with these first lines is that they are not CSV data, but random text that doesn't follow the column format. Because of this, SQL Server's Bulk Insert statement can't process these files.

    One option would be to use a PowerShell script, but using Get-content or streams would always involve reading the entire file and completely outputting it again. Is there a way to directly modify the file on-disk, without loading it entirely into memory and recreating the file?

    Preferably, I'm looking for a PowerShell way to do this. Although third-party tools might also be interesting...

    • ganesh
      ganesh over 7 years
      Not an answer to the question asked, but if you ever get to refactor: This is what a database does quite well.
    • Wouter
      Wouter over 7 years
      @Hennes: That would work if the first 3 lines were actual data lines, but they are random text. Edited my question to make this clearer. I formulated it badly earlier...
    • ganesh
      ganesh over 7 years
      Ah. I see plenty sulution which include readint he whole file (searched on "trim beginning of a file"). It will be interesting what comes up for windows and which does not read the unchanged parts.
    • Wouter
      Wouter over 7 years
      Editing by loading into RAM would of course be trivial :)
    • Ramhound
      Ramhound over 7 years
      Have you tried Get-Content along with Set-Content to get the first 3 lines and/or BaseStream and read/replace 3 lines? Unless you read in the entire file neither suggestion would result in the entire file being read into memory.
    • Neil McGuigan
      Neil McGuigan over 7 years
      Put GNU/Linux in a virtual machine and run it there
    • Yorik
      Yorik over 7 years
      You might be able to hex edit the 3 lines (or script a routine) into a proper row format with the proper number of field and record delimiters, and then use FIRSTROW. This would only require seeking a few bytes into the file.
  • LMiller7
    LMiller7 over 7 years
    Editing of file data that does not change it's length has been a standard file system function for a long time. Adding data to the end of a file is also directly supported. Removal of data from the end of a file is supported by most file systems. But any changes elsewhere that changes the length of the file is more problematic and I am not aware of any file system that directly supports this. It can be done with software but requires a lot of copying of file data and that is going to be slow, particularly with large files.
  • Wouter
    Wouter over 7 years
    Ah, very good to know! Now I know for sure I'll have to go with plan B :)
  • Wouter
    Wouter over 7 years
    Not an answer to the question, but as it seems that it won't be possible without loading into memory, this is definitely the next best thing. Have an upvote :)
  • chingNotCHing
    chingNotCHing over 7 years
    use -readcount 0 will definitely get an out of memory exception for a 3 GB file
  • SimonS
    SimonS over 7 years
    @chingNotCHing I tried it with a 5.4GB File right now and didn't get an out of memory exception(I got 8GB memory in my system). the memory was at 98% which isn't really good you're right, and PowerShell saves this memory until you close PowerShell. it's not optimal like this, you're right. change -readcount 0 to -readcount 100 or -readcount 1000 for a better outcome
  • chingNotCHing
    chingNotCHing over 7 years
    U r lucky, I was working on a 32bit windows that time.