Split one file into multiple files based on pattern (cut can occur within lines)

python perl awk split gnu

15,010

Solution 1

This performs the split without reading everything into RAM:

def files():
    n = 0
    while True:
        n += 1
        yield open('/output/dir/%d.part' % n, 'w')


pat = '<?xml'
fs = files()
outfile = next(fs) 

with open(filename) as infile:
    for line in infile:
        if pat not in line:
            outfile.write(line)
        else:
            items = line.split(pat)
            outfile.write(items[0])
            for item in items[1:]:
                outfile = next(fs)
                outfile.write(pat + item)

A word of warning: this doesn't work if your pattern spreads across multiple lines (that is, contains "\n"). Consider the mmap solution if this is the case.

Solution 2

Perl can parse large files line by line instead of slurping the whole file into memory. Here is a short script (with explanation):

perl -n -E 'if (/(.*)(<\?xml.*)/ ) {
   print $fh $1 if $1;
   open $fh, ">output." . ++$i;
   print $fh $2;
} else { print $fh $_ }'  in.txt

perl -n : The -n flag will loop over your file line by line (setting the contents to $_)

-E : Execute the following text (Perl expects a filename by default)

if (/(.*)(<\?xml.*) ) if a line matches <?xml split that line (using regex matchs) into $1 and $2.

print $fh $1 if $1 Print the start of the line to the old file.

open $fh, ">output.". ++$i; Create a new file-handle for writing.

print $fh $2 Print the rest of the line to the new file.

} else { print $fn $_ } If the line didn't match <?xml just print it to the current file-handle.

Note: this script assumes your input file starts with <?xml.

Solution 3

For files of that size, you'll probably want to use the mmap module, so you don't have to handle chunking up the file yourself. From the docs there:

Memory-mapped file objects behave like both strings and like file objects. Unlike normal string objects, however, these are mutable. You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing obj[index] = 'a', or change a substring by assigning to a slice: obj[i1:i2] = '...'. You can also read and write data starting at the current file position, and seek() through the file to different positions.

Here's a quick example that shows you how to find each occurrence of <?xml #> in the file. You can write the chunks to new files as you go, but I haven't written that part.

import mmap
import re

# a regex to match the "xml" nodes
r = re.compile(r'\<\?xml\s\d+\>')

with open('so.txt','r+b') as f:
    mp = mmap.mmap(f.fileno(),0)
    for m in r.finditer(mp):
        # here you can start collecting the starting positions and 
        # writing chunks to new files 
        print m.start()

15,010

LostInTranslation

Updated on August 02, 2022

Comments

LostInTranslation over 1 year
A lot of solutions exist, but the specificity here is I need to be able to split within a line, the cut should occur just before the pattern. Ex:

Infile:
```
<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla><?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla><?xml 2><blabla><blabla>
```
Should become with pattern <?xml

Outfile1:
```
<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla>
```
Outfile2:
```
<?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla>
```
Outfile3:
```
<?xml 2><blabla><blabla>
```
Actually the perl script in the validated answer here works fine for my little example. But it generates an error for my bigger (about 6GB) actual files. The error is:
```
panic: sv_setpvn called with negative strlen at /home/.../split.pl line 7, <> chunk 1.
```
I don't have the permissions to comment, that's why I started a new post. And finally, a Python solution would be even more appreciated, as I understand it better.
- alphaGeek almost 5 years
  
  not a perl solution, but based on awk and does the job:` awk -v RS="<?xml" '{print RS $0 > "Outfile"(NR-1)}' Infile `.
John Saunders over 11 years

Please explain your answer. This answer appears in the "low quality posts" list.
Anuj Gupta over 11 years

I hope you're not suggesting my_xml_Text_string will contain a 6GB string?
LostInTranslation over 11 years

Easy to understand and pretty efficient. Thanks!
Joran Beasley over 11 years

yeah I guess my selective vision skipped that part of the memo :P
LostInTranslation over 11 years

OK. That's not my case. The only (minor) issue is that it creates a 1st empty file.
LostInTranslation over 11 years

I like this solution, it seems very clever. As I have to process it in a python program which does something with the split files, I guess I could give it a list of mmap instead of files. My only problem is it's a bit hard to handle, not so simple.