Split one file into multiple files based on pattern (cut can occur within lines)
Solution 1
This performs the split without reading everything into RAM:
def files():
n = 0
while True:
n += 1
yield open('/output/dir/%d.part' % n, 'w')
pat = '<?xml'
fs = files()
outfile = next(fs)
with open(filename) as infile:
for line in infile:
if pat not in line:
outfile.write(line)
else:
items = line.split(pat)
outfile.write(items[0])
for item in items[1:]:
outfile = next(fs)
outfile.write(pat + item)
A word of warning: this doesn't work if your pattern spreads across multiple lines (that is, contains "\n"). Consider the mmap solution if this is the case.
Solution 2
Perl can parse large files line by line instead of slurping the whole file into memory. Here is a short script (with explanation):
perl -n -E 'if (/(.*)(<\?xml.*)/ ) {
print $fh $1 if $1;
open $fh, ">output." . ++$i;
print $fh $2;
} else { print $fh $_ }' in.txt
perl -n
: The -n flag will loop over your file line by line (setting the contents to $_)
-E
: Execute the following text (Perl expects a filename by default)
if (/(.*)(<\?xml.*) )
if a line matches <?xml
split that line (using regex matchs) into $1 and $2.
print $fh $1 if $1
Print the start of the line to the old file.
open $fh, ">output.". ++$i;
Create a new file-handle for writing.
print $fh $2
Print the rest of the line to the new file.
} else { print $fn $_ }
If the line didn't match <?xml
just print it to the current file-handle.
Note: this script assumes your input file starts with <?xml
.
Solution 3
For files of that size, you'll probably want to use the mmap
module, so you don't have to handle chunking up the file yourself. From the docs there:
Memory-mapped file objects behave like both strings and like file objects. Unlike normal string objects, however, these are mutable. You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing
obj[index] = 'a'
, or change a substring by assigning to a slice:obj[i1:i2] = '...'
. You can also read and write data starting at the current file position, andseek()
through the file to different positions.
Here's a quick example that shows you how to find each occurrence of <?xml #>
in the file. You can write the chunks to new files as you go, but I haven't written that part.
import mmap
import re
# a regex to match the "xml" nodes
r = re.compile(r'\<\?xml\s\d+\>')
with open('so.txt','r+b') as f:
mp = mmap.mmap(f.fileno(),0)
for m in r.finditer(mp):
# here you can start collecting the starting positions and
# writing chunks to new files
print m.start()
Related videos on Youtube
LostInTranslation
Updated on August 02, 2022Comments
-
LostInTranslation over 1 year
A lot of solutions exist, but the specificity here is I need to be able to split within a line, the cut should occur just before the pattern. Ex:
Infile:
<?xml 1><blabla1> <blabla><blabla2><blabla> <blabla><blabla> <blabla><blabla3><blabla><blabla> <blabla><blabla><blabla><?xml 4> <blabla> <blabla><blabla><blabla> <blabla><?xml 2><blabla><blabla>
Should become with pattern
<?xml
Outfile1:
<?xml 1><blabla1> <blabla><blabla2><blabla> <blabla><blabla> <blabla><blabla3><blabla><blabla> <blabla><blabla><blabla>
Outfile2:
<?xml 4> <blabla> <blabla><blabla><blabla> <blabla>
Outfile3:
<?xml 2><blabla><blabla>
Actually the
perl
script in the validated answer here works fine for my little example. But it generates an error for my bigger (about 6GB) actual files. The error is:panic: sv_setpvn called with negative strlen at /home/.../split.pl line 7, <> chunk 1.
I don't have the permissions to comment, that's why I started a new post. And finally, a
Python
solution would be even more appreciated, as I understand it better.-
alphaGeek almost 5 yearsnot a perl solution, but based on awk and does the job:` awk -v RS="<?xml" '{print RS $0 > "Outfile"(NR-1)}' Infile `.
-
-
John Saunders over 11 yearsPlease explain your answer. This answer appears in the "low quality posts" list.
-
Anuj Gupta over 11 yearsI hope you're not suggesting my_xml_Text_string will contain a 6GB string?
-
LostInTranslation over 11 yearsEasy to understand and pretty efficient. Thanks!
-
Joran Beasley over 11 yearsyeah I guess my selective vision skipped that part of the memo :P
-
LostInTranslation over 11 yearsOK. That's not my case. The only (minor) issue is that it creates a 1st empty file.
-
LostInTranslation over 11 yearsI like this solution, it seems very clever. As I have to process it in a python program which does something with the split files, I guess I could give it a list of mmap instead of files. My only problem is it's a bit hard to handle, not so simple.