Efficiently remove the last two lines of an extremely large text file

31,158

Solution 1

I haven't tried this on a large file to see how fast it is, but it should be fairly quick.

To use the script to remove lines from the end of a file:

./shorten.py 2 large_file.txt

It seeks to the end of the file, checks to make sure the last character is a newline, then reads each character one at a time going backwards until it's found three newlines and truncates the file just after that point. The change is made in place.

Edit: I've added a Python 2.4 version at the bottom.

Here is a version for Python 2.5/2.6:

#!/usr/bin/env python2.5
from __future__ import with_statement
# also tested with Python 2.6

import os, sys

if len(sys.argv) != 3:
    print sys.argv[0] + ": Invalid number of arguments."
    print "Usage: " + sys.argv[0] + " linecount filename"
    print "to remove linecount lines from the end of the file"
    exit(2)

number = int(sys.argv[1])
file = sys.argv[2]
count = 0

with open(file,'r+b') as f:
    f.seek(0, os.SEEK_END)
    end = f.tell()
    while f.tell() > 0:
        f.seek(-1, os.SEEK_CUR)
        char = f.read(1)
        if char != '\n' and f.tell() == end:
            print "No change: file does not end with a newline"
            exit(1)
        if char == '\n':
            count += 1
        if count == number + 1:
            f.truncate()
            print "Removed " + str(number) + " lines from end of file"
            exit(0)
        f.seek(-1, os.SEEK_CUR)

if count < number + 1:
    print "No change: requested removal would leave empty file"
    exit(3)

Here's a Python 3 version:

#!/usr/bin/env python3.0

import os, sys

if len(sys.argv) != 3:
    print(sys.argv[0] + ": Invalid number of arguments.")
    print ("Usage: " + sys.argv[0] + " linecount filename")
    print ("to remove linecount lines from the end of the file")
    exit(2)

number = int(sys.argv[1])
file = sys.argv[2]
count = 0

with open(file,'r+b', buffering=0) as f:
    f.seek(0, os.SEEK_END)
    end = f.tell()
    while f.tell() > 0:
        f.seek(-1, os.SEEK_CUR)
        print(f.tell())
        char = f.read(1)
        if char != b'\n' and f.tell() == end:
            print ("No change: file does not end with a newline")
            exit(1)
        if char == b'\n':
            count += 1
        if count == number + 1:
            f.truncate()
            print ("Removed " + str(number) + " lines from end of file")
            exit(0)
        f.seek(-1, os.SEEK_CUR)

if count < number + 1:
    print("No change: requested removal would leave empty file")
    exit(3)

Here is a Python 2.4 version:

#!/usr/bin/env python2.4

import sys

if len(sys.argv) != 3:
    print sys.argv[0] + ": Invalid number of arguments."
    print "Usage: " + sys.argv[0] + " linecount filename"
    print "to remove linecount lines from the end of the file"
    sys.exit(2)

number = int(sys.argv[1])
file = sys.argv[2]
count = 0
SEEK_CUR = 1
SEEK_END = 2

f = open(file,'r+b')
f.seek(0, SEEK_END)
end = f.tell()

while f.tell() > 0:
    f.seek(-1, SEEK_CUR)
    char = f.read(1)
    if char != '\n' and f.tell() == end:
        print "No change: file does not end with a newline"
        f.close()
        sys.exit(1)
    if char == '\n':
        count += 1
    if count == number + 1:
        f.truncate()
        print "Removed " + str(number) + " lines from end of file"
        f.close()
        sys.exit(0)
    f.seek(-1, SEEK_CUR)

if count < number + 1:
    print "No change: requested removal would leave empty file"
    f.close()
    sys.exit(3)

Solution 2

you can try GNU head

head -n -2 file

Solution 3

I see my Debian Squeeze/testing systems (but not Lenny/stable) include a "truncate" command as part of the "coreutils" package.

With it you could simply do something like

truncate --size=-160 myfile

to remove 160 bytes from the end of the file (obviously you need to figure out exactly how many characters you need to remove).

Solution 4

The problem with sed is that it is a stream editor -- it will process the entire file even if you only want to make modifications near the end. So no matter what, you are creating a new 400GB file, line by line. Any editor that operates on the whole file will probably have this problem.

If you know the number of lines, you could use head, but again this creates a new file instead of altering the existing one in place. You might get speed gains from the simplicity of the action, I guess.

You might have better luck using split to break the file into smaller pieces, editing the last one, and then using cat to combine them again, but I'm not sure if it will be any better. I would use byte counts rather than lines, otherwise it will probably be no faster at all -- you're still going to be creating a new 400GB file.

Solution 5

Try VIM...I'm not sure if it will do the trick or not, as I've never used it on such a big file, but I've used it on smaller larger files in the past give it try.

Share:
31,158

Related videos on Youtube

Russ Bradberry
Author by

Russ Bradberry

I am the CTO at SimpleReach, the leading content data platform. I am responsible for designing and building out highly scalable, high volume, distributed data solutions. Prior to SimpleReach I built out an SSP that included automated ad-network optimization and inventory forecasting. Additionally, I managed the engineering efforts for one of San Diego's foremost web design firms and an employment and training organization that focuses its efforts on placing Veterans in positions that highlight their capabilities. I am a US Navy Veteran, a DataStax MVP for Apache Cassandra, and co-author of "Practical Cassandra" A developer's guide to Apache Cassandra

Updated on September 17, 2022

Comments

  • Russ Bradberry
    Russ Bradberry over 1 year

    I have a very large file (~400 GB), and I need to remove the last 2 lines from it. I tried to use sed, but it ran for hours before I gave up. Is there a quick way of doing this, or am I stuck with sed?

    • Admin
      Admin about 14 years
      you can give GNU head a try. head -n -2 file
    • Admin
      Admin about 14 years
      There were a couple of one line Perl and Java suggestions given in stackoverflow.com/questions/2580335/…
  • UNK
    UNK about 14 years
    I do believe vim only loads what's immediately around the buffer when editing, however I've no idea how it saves.
  • Russ Bradberry
    Russ Bradberry about 14 years
    vim hangs while it tries to load the file
  • Russ Bradberry
    Russ Bradberry about 14 years
    it is formatted pipe delimeted text, however the last 2 lines are one column each which will break my import so I need them removed
  • leeand00
    leeand00 about 14 years
    Well if it hangs, ah wait for it. Start it loading, go to work, come home, see if it is done.
  • leeand00
    leeand00 about 14 years
  • timday
    timday about 14 years
    is fixing whatever does the "import" to deal with this case an option ?
  • Russ Bradberry
    Russ Bradberry about 14 years
    no the import is infobright's "load data infile"
  • Russ Bradberry
    Russ Bradberry about 14 years
    our system is running python 2.4, and I'm not sure if any of our services rely on it, will this work in that?
  • Russ Bradberry
    Russ Bradberry about 14 years
    i tried this, but it was going about the same speed as sed. It had written approx 200MB in 10 mins, at this rate it would literally take hundreds of hours to complete.
  • Russ Bradberry
    Russ Bradberry about 14 years
    I'm using CentOS, so no I do not have truncate. However, this is exactly what I am looking for.
  • Dennis Williamson
    Dennis Williamson about 14 years
    @Russ: I've added a version for Python 2.4.
  • Russ Bradberry
    Russ Bradberry about 14 years
    absolutely amazing! worked like a charm and in less than a second!
  • Dennis Williamson
    Dennis Williamson about 14 years
    On my system, using a text file consisting of a million lines and over 57MB, ed took 100 times as long to execute than my Python script. I can only imagine how much more the difference would be for the OP's file which is 7000 times bigger.
  • xiao
    xiao over 12 years
    It is the best solution since it is simple.
  • Daniel Andersson
    Daniel Andersson about 12 years
    @SooDesuNe: No it will print all lines from the beginning to 2 lines from the end, as per the manual. However, this would need to be redirected to a file, and then there is the problem with this file being giant, so it's not the perfect solution for this problem.
  • aefxx
    aefxx over 11 years
    +1 Why isn't this being accepted as the correct answer? It's fast, simple and does work as expected.
  • mreq
    mreq about 11 years
    @DanielAndersson Why not? You can head -n -2 file > output...
  • krlmlr
    krlmlr over 7 years
    tail is efficient for large files, too -- can use tail | wc -c to compute number of bytes to be trimmed.