Splitting a file in linux based on content
Solution 1
If you have a mail.txt
$ cat mail.txt
<html>
mail A
</html>
<html>
mail B
</html>
<html>
mail C
</html>
run csplit
to split by <html>
$ csplit mail.txt '/^<html>$/' '{*}'
- mail.txt => input file
- /^<html>$/ => pattern match every `<html>` line
- {*} => repeat the previous pattern as many times as possible
check output
$ ls
mail.txt xx00 xx01 xx02 xx03
If you want do it in awk
$ awk '/<html>/{filename=NR".txt"}; {print >filename}' mail.txt
$ ls
1.txt 5.txt 9.txt mail.txt
Solution 2
csplit
is the best solution to this problem. Just thought I'd post a bash-solution to show that there is no need to go perl on this task:
#!/usr/bin/bash
MAIL='mail' # path to huge mail-file
#get linenumbers for all headers
line_no=$(grep -n html $MAIL | cut -d: -f1)
read -a LINES<<< $line_no
file=0
for i in $(seq 0 2 ${#LINES[@]}); do
start=${LINES[i]}
end=$((${LINES[i+1]}-1))
echo $start, $end
sed -n "${start},${end}p" $MAIL > ${MAIL}${file}.txt
file=$((file+1))
done
Solution 3
The csplit
program solves your problem elegantly:
csplit '/<!DOCTYPE.*/' $FILE
Solution 4
I agree with fge. With perl
it would be a lot simpler. You can try something like this -
#!/usr/bin/perl
undef $/;
$_ = <>;
$n = 0;
for $match (split(/(?=HEADER_FORMAT)/)) {
open(O, '>mail' . ++$n);
print O $match;
close(O);
}
Replace HEADER_FORMAT
with your header type.
Solution 5
It is doable with some perl "magic"... Many people would call this ugly but here goes.
The trick is to replace $/
with what you want and read your input, as such:
#!/usr/bin/perl -W
use strict;
my $i = 1;
$/ = <<EOF;
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head> <xmeta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
EOF
open INPUT, "/path/to/inputfile" or die;
while (my $mail = <INPUT>) {
$mail = substr($mail, 0, index($mail, $/));
open OUTPUT, ">/path/to/emailfile." . $i . ".txt" or die;
$i++;
print OUTPUT $mail;
close OUTPUT;
}
edit: fixed, I always forget that $/
is included in the input. Also, the first file will always be empty, but then it can be easily handled.
Comments
-
Greenhorn almost 2 years
I have an email dump of around 400mb. I want to split this into .txt files, consisting of one mail in each file. Every e-mail starts with the standard HTML header specifying the doctype.
This means I will have to split my files based on the above said header. How do I go about it in linux?
-
fge over 12 yearsYep, a positive lookahead would work nicely, especially since here the header does not contain any metacharacter. You could even use
qr//
to build the split regex. -
Greenhorn over 12 yearsAm afraid! I did the same and did a $ls mail.txt xx00 and obviously mail.txt was the same as xx00 Any fixes?
-
kev over 12 years@Ramprakash My
csplit
's ver is8.5
. Maybe yours don't have the{*}
which repeat pattern. please check manpage. I just addawk
solution. You can try it. -
Daniel Gasienica about 8 years@Greenhorn My version of
csplit
also didn’t support{*}
, but this worked:csplit -n 6 -f 'mail-' -k mail.txt '/^<html>$/' '{5000}'
-
qwertzguy about 7 yearsArguments are in the wrong order and is missing the repetition to actually do as intended.
-
mwfearnley almost 5 yearsTo prevent an
awk
error if the first line doesn't match the pattern (forgawk
at least), do:awk 'BEGIN {filename="0.txt"} /...'
-
boutta about 2 yearsIn the seq command, I don't know why a step-width of 2 was chosen. I changed it to 1 in order to work for me.